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Preface 


Just a few years ago, there were no legions of deep learning scientists developing intelligent prod- 
ucts and services at major companies and startups. When the youngest among us (the authors) 
entered the field, machine learning did not command headlines in daily newspapers. Our parents 
had no idea what machine learning was, let alone why we might prefer it to a career in medicine or 
law. Machine learning was a forward-looking academic discipline with a narrow set of real-world 
applications. And those applications, e.g., speech recognition and computer vision, required so 
much domain knowledge that they were often regarded as separate areas entirely for which ma- 
chine learning was one small component. Neural networks then, the antecedents of the deep 
learning models that we focus on in this book, were regarded as outmoded tools. 


In just the past five years, deep learning has taken the world by surprise, driving rapid progress 
in fields as diverse as computer vision, natural language processing, automatic speech recogni- 
tion, reinforcement learning, and statistical modeling. With these advances in hand, we can now 
build cars that drive themselves with more autonomy than ever before (and less autonomy than 
some companies might have you believe), smart reply systems that automatically draft the most 
mundane emails, helping people dig out from oppressively large inboxes, and software agents that 
dominate the world's best humans at board games like Go, a feat once thought to be decades away. 
Already, these tools exert ever-wider impacts on industry and society, changing the way movies 
are made, diseases are diagnosed, and playing a growing role in basic sciences—from astrophysics 
to biology. 


About This Book 


This book represents our attempt to make deep learning approachable, teaching you the concepts, 
the context, and the code. 


One Medium Combining Code, Math, and HTML 


For any computing technology to reach its full impact, it must be well-understood, well- 
documented, and supported by mature, well-maintained tools. The key ideas should be clearly 
distilled, minimizing the onboarding time needing to bring new practitioners up to date. Mature 
libraries should automate common tasks, and exemplar code should make it easy for practitioners 
to modify, apply, and extend common applications to suit their needs. Take dynamic web appli- 
cations as an example. Despite a large number of companies, like Amazon, developing successful 
database-driven web applications in the 1990s, the potential of this technology to aid creative en- 
trepreneurs has been realized to a far greater degree in the past ten years, owing in part to the 
development of powerful, well-documented frameworks. 





Testing the potential of deep learning presents unique challenges because any single application 
brings together various disciplines. Applying deep learning requires simultaneously understand- 
ing (i) the motivations for casting a problem in a particular way; (ii) the mathematics of a given 
modeling approach; (iii) the optimization algorithms for fitting the models to data; and (iv) the 
engineering required to train models efficiently, navigating the pitfalls of numerical computing 
and getting the most out of available hardware. Teaching both the critical thinking skills required 
to formulate problems, the mathematics to solve them, and the software tools to implement those 
solutions all in one place presents formidable challenges. Our goal in this book is to present a 
unified resource to bring would-be practitioners up to speed. 


At the time we started this book project, there were no resources that simultaneously (i) were 
up to date; (ii) covered the full breadth of modern machine learning with substantial technical 
depth; and (iii) interleaved exposition of the quality one expects from an engaging textbook with 
the clean runnable code that one expects to find in hands-on tutorials. We found plenty of code 
examples for how to use a given deep learning framework (e.g., how to do basic numerical com- 
puting with matrices in TensorFlow) or for implementing particular techniques (e.g., code snip- 
pets for LeNet, AlexNet, ResNets, etc) scattered across various blog posts and GitHub repositories. 
However, these examples typically focused on how to implementa given approach, but left out the 
discussion of why certain algorithmic decisions are made. While some interactive resources have 
popped up sporadically to address a particular topic, e.g., the engaging blog posts published on 
the website Distill*, or personal blogs, they only covered selected topics in deep learning, and 
often lacked associated code. On the other hand, while several textbooks have emerged, most no- 
tably (Goodfellow et al., 2016), which offers a comprehensive survey of the concepts behind deep 
learning, these resources do not marry the descriptions to realizations of the concepts in code, 
sometimes leaving readers clueless as to how to implement them. Moreover, too many resources 
are hidden behind the paywalls of commercial course providers. 


We set out to create a resource that could (i) be freely available for everyone; (ii) offer sufficient 
technical depth to provide a starting point on the path to actually becoming an applied machine 
learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; 
(iv) allow for rapid updates, both by us and also by the community at large; and (v) be comple- 
mented by a forum? for interactive discussion of technical details and to answer questions. 


These goals were often in conflict. Equations, theorems, and citations are best managed and laid 
out in LaTeX. Code is best described in Python. And webpages are native in HTML and JavaScript. 
Furthermore, we want the content to be accessible both as executable code, as a physical book, 
as a downloadable PDF, and on the Internet as a website. At present there exist no tools and no 
workflow perfectly suited to these demands, so we had to assemble our own. We describe our 
approach in detail in Section 19.6. We settled on GitHub to share the source and to allow for edits, 
Jupyter notebooks for mixing code, equations and text, Sphinx as a rendering engine to generate 
multiple outputs, and Discourse for the forum. While our system is not yet perfect, these choices 
provide a good compromise among the competing concerns. We believe that this might be the 
first book published using such an integrated workflow. 





3 http://distill.pub 
“ http://discuss.d21.ai 
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Learning by Doing 


Many textbooks teach a series of topics, each in exhaustive detail. For example, Chris Bishop's 
excellent textbook (Bishop, 2006), teaches each topic so thoroughly, that getting to the chapter on 
linear regression requires a non-trivial amount of work. While experts love this book precisely 
for its thoroughness, for beginners, this property limits its usefulness as an introductory text. 


In this book, we will teach most concepts just in time. In other words, you will learn concepts at the 
very moment that they are needed to accomplish some practical end. While we take some time at 
the outset to teach fundamental preliminaries, like linear algebra and probability, we want you to 
taste the satisfaction of training your first model before worrying about more esoteric probability 
distributions. 


Aside from a few preliminary notebooks that provide a crash course in the basic mathematical 
background, each subsequent chapter introduces both a reasonable number of new concepts and 
provides single self-contained working examples—using real datasets. This presents an organi- 
zational challenge. Some models might logically be grouped together in a single notebook. And 
some ideas might be best taught by executing several models in succession. On the other hand, 
there is a big advantage to adhering to a policy of one working example, one notebook: This makes 
it as easy as possible for you to start your own research projects by leveraging our code. Just copy 
a notebook and start modifying it. 


We will interleave the runnable code with background material as needed. In general, we will 
often err on the side of making tools available before explaining them fully (and we will follow up 
by explaining the background later). For instance, we might use stochastic gradient descent before 
fully explaining why it is useful or why it works. This helps to give practitioners the necessary 
ammunition to solve problems quickly, at the expense of requiring the reader to trust us with 
some curatorial decisions. 


This book will teach deep learning concepts from scratch. Sometimes, we want to delve into fine 
details about the models that would typically be hidden from the user by deep learning frame- 
works' advanced abstractions. This comes up especially in the basic tutorials, where we want you 
to understand everything that happens in a given layer or optimizer. In these cases, we will often 
present two versions of the example: one where we implement everything from scratch, relying 
only on the NumPy interface and automatic differentiation, and another, more practical exam- 
ple, where we write succinct code using high-level APIs of deep learning frameworks. Once we 
have taught you how some component works, we can just use the high-level APIs in subsequent 
tutorials. 
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Content and Structure 


The book can be roughly divided into three parts, which are presented by different colors in Fig. 


1: 












1. Introduction 
3. Linear Neural Networks 
4. Multilayer Perceptrons 
5. Deep Learning 
Computation 
8. Recurrent Neural 
Networks 
9. Modern Recurrent 
Neural Networks O 


10. Attention Mechanisms 















6. Convolutional Neural 
Networks 
7. Modern Convolutional 
Neural Networks 









Fig. 1: Book structure 


The first part covers basics and preliminaries. Chapter 1 offers an introduction to deep learn- 
ing. Then, in Chapter 2, we quickly bring you up to speed on the prerequisites required for 
hands-on deep learning, such as how to store and manipulate data, and howto apply various 
numerical operations based on basic concepts from linear algebra, calculus, and probabil- 
ity. Chapter 3 and Chapter 4 cover the most basic concepts and techniques of deep learning, 
such as linear regression, multilayer perceptrons and regularization. 


The next five chapters focus on modern deep learning techniques. Chapter 5 describes the 
various key components of deep learning calculations and lays the groundwork for us to 
subsequently implement more complex models. Next, in Chapter 6 and Chapter 7, we intro- 
duce convolutional neural networks (CNNs), powerful tools that form the backbone of most 
modern computer vision systems. Subsequently, in Chapter 8 and Chapter 9, we introduce 
recurrent neural networks (RNNs), models that exploit temporal or sequential structure in 
data, and are commonly used for natural language processing and time series prediction. 
In Chapter 10, we introduce a new class of models that employ a technique called attention 
mechanisms and they have recently begun to displace RNNs in natural language processing. 
These sections will get you up to speed on the basic tools behind most modern applications 
of deep learning. 


Part three discusses scalability, efficiency, and applications. First, in Chapter 11, we dis- 
cuss several common optimization algorithms used to train deep learning models. The next 
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chapter, Chapter 12 examines several key factors that influence the computational perfor- 
mance of your deep learning code. In Chapter 13, we illustrate major applications of deep 
learning in computer vision. In Chapter 14 and Chapter 15, we show how to pretrain lan- 
guage representation models and apply them to natural language processing tasks. 


Code 


Most sections of this book feature executable code because of our belief in the importance of an 
interactive learning experience in deep learning. At present, certain intuitions can only be devel- 
oped through trial and error, tweaking the code in small ways and observing the results. Ideally, 
an elegant mathematical theory might tell us precisely how to tweak our code to achieve a desired 
result. Unfortunately, at present, such elegant theories elude us. Despite our best attempts, for- 
mal explanations for various techniques are still lacking, both because the mathematics to char- 
acterize these models can be so difficult and also because serious inquiry on these topics has only 
just recently kicked into high gear. We are hopeful that as the theory of deep learning progresses, 
future editions of this book will be able to provide insights in places the present edition cannot. 


At times, to avoid unnecessary repetition, we encapsulate the frequently-imported and referred-to 
functions, classes, etc. in this book in the d21 package. For any block such as a function, a class, 
or multiple imports to be saved in the package, we will mark it with #@save. We offer a detailed 
overview of these functions and classes in Section 19.7. The d21 package is light-weight and only 
requires the following packages and modules as dependencies: 


#@save 

import collections 

from collections import defaultdict 
from IPython import display 

import math 

from matplotlib import pyplot as plt 
import os 

import pandas as pd 

import random 

import re 

import shutil 

import sys 

import tarfile 

import time 

import requests 

import zipfile 

import hashlib 

d21 = sys.modules[__name__] 


Most of the code in this book is based on Apache MXNet. MXNet is an open-source framework for 
deep learning and the preferred choice of AWS (Amazon Web Services), as well as many colleges 
and companies. All of the code in this book has passed tests under the newest MXNet version. 
However, due to the rapid development of deep learning, some code in the print edition may not 
work properly in future versions of MXNet. However, we plan to keep the online version up-to- 
date. In case you encounter any such problems, please consult Installation (page 9) to update your 
code and runtime environment. 


Here is how we import modules from MXNet. 
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#@save 
from mxnet import autograd, context, gluon, image, init, np, npx 
from mxnet.gluon import nn, rnn 


Target Audience 


This book is for students (undergraduate or graduate), engineers, and researchers, who seek a 
solid grasp of the practical techniques of deep learning. Because we explain every concept from 
scratch, no previous background in deep learning or machine learning is required. Fully explain- 
ing the methods of deep learning requires some mathematics and programming, but we will only 
assume that you come in with some basics, including (the very basics of) linear algebra, calcu- 
lus, probability, and Python programming. Moreover, in the Appendix, we provide a refresher 
on most of the mathematics covered in this book. Most of the time, we will prioritize intuition 
and ideas over mathematical rigor. There are many terrific books which can lead the interested 
reader further. For instance, Linear Analysis by Bela Bollobas (Bollobas, 1999) covers linear alge- 
bra and functional analysis in great depth. All of Statistics (Wasserman, 2013) is a terrific guide to 
statistics. And if you have not used Python before, you may want to peruse this Python tutorial”. 


Forum 


Associated with this book, we have launched a discussion forum, located at discuss.d2l.ai°. When 
you have questions on any section of the book, you can find the associated discussion page link at 
the end of each chapter. 
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Summary 


Deep learning has revolutionized pattern recognition, introducing technology that now 
powers a wide range of technologies, including computer vision, natural language process- 
ing, automatic speech recognition. 


To successfully apply deep learning, you must understand how to cast a problem, the math- 
ematics of modeling, the algorithms for fitting your models to data, and the engineering 
techniques to implement it all. 


This book presents a comprehensive resource, including prose, figures, mathematics, and 
code, all in one place. 


To answer questions related to this book, visit our forum at https://discuss.d21.ai/. 


All notebooks are available for download on GitHub. 


Exercises 


1. Register an account on the discussion forum of this book discuss.d2l.ai®. 
2. Install Python on your computer. 


3. Follow the links at the bottom of the section to the forum, where you will be able to seek out 
help and discuss the book and find answers to your questions by engaging the authors and 
broader community. 


Discussions” 
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Installation 


In order to get you up and running for hands-on learning experience, we need to set you up with an 
environment for running Python, Jupyter notebooks, the relevant libraries, and the code needed 
to run the book itself. 


Installing Miniconda 


The simplest way to get going will be to install Miniconda*%, The Python 3.x version is required. 
You can skip the following steps if conda has already been installed. Download the corresponding 
Miniconda sh file from the website and then execute the installation from the command line using 
sh <FILENAME> -b. For macOS users: 


# The file name is subject to changes 
sh Miniconda3-latest-MacOSX-x86_64.sh -b 


For Linux users: 


# The file name is subject to changes 
sh Miniconda3-latest-Linux-x86_64.sh -b 


Next, initialize the shell so we can run conda directly. 


~/miniconda3/bin/conda init 


Now close and re-open your current shell. You should be able to create a new environment as 
following: 


conda create --name d21 python=3.8 -y 





10 https://conda.io/en/latest/miniconda.html 





Downloading the D2L Notebooks 


Next, we need to download the code of this book. You can click the “All Notebooks” tab on the top 
of any HTML page to download and unzip the code. Alternatively, if you have unzip (otherwise 
run sudo apt install unzip) available: 


mkdir d21-en 88 cd d21-en 
curl https://d21.ai/d21-en.zip -o d21-en.zip 
unzip d21-en.zip 88 rm d21-en.zip 


Now we will want to activate the d21 environment. 


conda activate d21 


Installing the Framework and the d21 Package 


Before installing the deep learning framework, please first check whether or not you have proper 
GPUs on your machine (the GPUs that power the display on a standard laptop do not count for our 
purposes). If you are installing on a GPU server, proceed to GPU Support (page 11) for instructions 
to install a GPU-supported version. 


Otherwise, you can install the CPU version as follows. That will be more than enough horsepower 
to get you through the first few chapters but you will want to access GPUs before running larger 
models. 


pip install mxnet==1.7.0.postl 


We also install the d21 package that encapsulates frequently used functions and classes in this 
book. 


# -U: Upgrade all packages to the newest available version 
pip install -U d21 


Once they are installed, we now open the Jupyter notebook by running: 


jupyter notebook 


At this point, you can open http://localhost:8888 (it usually opens automatically) in your Web 
browser. Then we can run the code for each section of the book. Please always execute conda ac- 
tivate d21 to activate the runtime environment before running the code of the book or updating 
the deep learning framework or the d21 package. To exit the environment, run conda deactivate. 
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GPU Support 


By default, MXNet is installed without GPU support to ensure that it will run on any computer 
(including most laptops). Part of this book requires or recommends running with GPU. If your 
computer has NVIDIA graphics cards and has installed CUDA", then you should install a GPU- 
enabled version. If you have installed the CPU-only version, you may need to remove it first by 
running: 


pip uninstall mxnet 


Then we need to find the CUDA version you installed. You may check it through nvcc --version 
or cat /usr/local/cuda/version.txt. Assume that you have installed CUDA 10.1, then you can 
install with the following command: 


# For Windows users 
pip install mxnet-cul01==1.7.0 -f https: //dist.mxnet.io/python 


# For Linux and macOS users 
pip install mxnet-cul01==1.7.0 


You may change the last digits according to your CUDA version, e.g., cu100 for CUDA 10.0 and cu9@ 
for CUDA 9.0. 


Exercises 


1. Download the code for the book and install the runtime environment. 


Discussions!2 





 https://developer.nvidia.com/cuda-downloads 
2 https://discuss.d21.ai/t/23 
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Notation 


The notation used throughout this book is summarized below. 


Numbers 


x: A scalar 

x: A vector 

X: A matrix 

X: A tensor 

I: An identity matrix 

zi, [X];: The i” element of vector x 


Lijy Vi,jy[X]ij, X]: 5: The element of matrix X at row i and column j 


Set Theory 


X: A set 
Z: The set of integers 


Z+: The set of positive integers 





: The set of real numbers 





n: The set of n-dimensional vectors of real numbers 








axb: The set of matrices of real numbers with a rows and b columns 





X|: Cardinality (number of elements) of set Y 
AUB: Union of sets A and B 

AN B: Intersection of sets A and B 

A \ B: Subtraction of set B from set A 
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Functions and Operators 


F(-): A function 

log(-): The natural logarithm 

exp(-): The exponential function 

1 +: The indicator function 

(.)': Transpose of a vector or a matrix 
X”!: Inverse of matrix X 

©: Hadamard (elementwise) product 
[-, -]: Concatenation 

|X|: Cardinality of set Y 

| Ilp: Lp norm 

|| - ||: Lo norm 

(x, y): Dot product of vectors x and y 
y=: Series addition 


Į [: Series multiplication 


def ue 
£. Definition 


Calculus 


a Derivative of y with respect to x 


ou: Partial derivative of y with respect to x 
Vxy: Gradient of y with respect to x 
i f(x) dx: Definite integral of f from a to b with respect to x 


J f(x) dx: Indefinite integral of f with respect to x 


Probability and Information Theory 


P(-): Probability distribution 

z ~ P: Random variable z has probability distribution P 
P(X | Y): Conditional probability of X | Y 

p(x): Probability density function 

E,|f(x)|: Expectation of f with respect to x 

X | Y: Random variables X and Y are independent 
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e X 1 Y | Z: Random variables X and Y are conditionally independent given random vari- 
able Z 


Var(X): Variance of random variable X 


ox: Standard deviation of random variable X 


Cov(X, Y): Covariance of random variables X and Y 


p(X, Y ): Correlation of random variables X and Y 


H(X): Entropy of random variable X 
Dyu (P||Q): KL-divergence of distributions P and Q 


Complexity 


e O: Big O notation 


Discussions!’ 
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1 Introduction 


Until recently, nearly every computer program that we interact with daily was coded by software 
developers from first principles. Say that we wanted to write an application to manage an e- 
commerce platform. After huddling around a whiteboard for a few hours to ponder the prob- 
lem, we would come up with the broad strokes of a working solution that might probably look 
something like this: (i) users interact with the application through an interface running in a web 
browser or mobile application; (ii) our application interacts with a commercial-grade database 
engine to keep track of each user's state and maintain records of historical transactions; and (iii) 
atthe heart of our application, the business logic (you might say, the brains) of our application spells 
outin methodical detail the appropriate action that our program should take in every conceivable 
circumstance. 


To build the brains of our application, we would have to step through every possible corner case 
that we anticipate encountering, devising appropriate rules. Each time a customer clicks to add 
an item to their shopping cart, we add an entry to the shopping cart database table, associating 
that user's ID with the requested product's ID. While few developers ever get it completely right 
the first time (it might take some test runs to work out the kinks), for the most part, we could write 
such a program from first principles and confidently launch it before ever seeing a real customer. 
Our ability to design automated systems from first principles that drive functioning products and 
systems, often in novel situations, is a remarkable cognitive feat. And when you are able to devise 
solutions that work 100% of the time, you should not be using machine learning. 


Fortunately for the growing community of machine learning scientists, many tasks that we would 
like to automate do not bend so easily to human ingenuity. Imagine huddling around the white- 
board with the smartest minds you know, but this time you are tackling one of the following prob- 
lems: 


e Write a program that predicts tomorrow's weather given geographic information, satellite 
images, and a trailing window of past weather. 


Write a program that takes in a question, expressed in free-form text, and answers it cor- 
rectly. 


Write a program that given an image can identify all the people it contains, drawing outlines 
around each. 


Write a program that presents users with products that they are likely to enjoy but unlikely, 
in the natural course of browsing, to encounter. 


In each of these cases, even elite programmers are incapable of coding up solutions from scratch. 
The reasons for this can vary. Sometimes the program that we are looking for follows a pattern 
that changes over time, and we need our programs to adapt. In other cases, the relationship (say 
between pixels, and abstract categories) may be too complicated, requiring thousands or millions 
of computations that are beyond our conscious understanding even if our eyes manage the task 
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effortlessly. Machine learning is the study of powerful techniques that can learn from experience. 
As an machine learning algorithm accumulates more experience, typically in the form of obser- 
vational data or interactions with an environment, its performance improves. Contrast this with 
our deterministic e-commerce platform, which performs according to the same business logic, 
no matter how much experience accrues, until the developers themselves learn and decide that it 
is time to update the software. In this book, we will teach you the fundamentals of machine learn- 
ing, and focus in particular on deep learning, a powerful set of techniques driving innovations in 
areas as diverse as computer vision, natural language processing, healthcare, and genomics. 


1.1 A Motivating Example 


Before beginning writing, the authors of this book, like much of the work force, had to become 
caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called out “Hey Siri”, 
awakening the phone’s voice recognition system. Then Mu commanded “directions to Blue Bottle 
coffee shop”. The phone quickly displayed the transcription of his command. It also recognized 
that we were asking for directions and launched the Maps application (app) to fulfill our request. 
Once launched, the Maps app identified a number of routes. Next to each route, the phone dis- 
played a predicted transit time. While we fabricated this story for pedagogical convenience, it 
demonstrates that in the span of just a few seconds, our everyday interactions with a smart phone 
can engage several machine learning models. 


Imagine just writing a program to respond to a wake word such as “Alexa”, “OK Google”, and “Hey 
Siri”. Try coding it up in a room by yourself with nothing but a computer and a code editor, as 
illustrated in Fig. 1.1.1. How would you write such a program from first principles? Think about 
it... the problem is hard. Every second, the microphone will collect roughly 44000 samples. Each 
sample is a measurement of the amplitude of the sound wave. What rule could map reliably from 
a snippet of raw audio to confident predictions (yes, no} on whether the snippet contains the wake 
word? If you are stuck, do not worry. We do not know how to write such a program from scratch 
either. That is why we use machine learning. 


OD a — oe 


Fig. 1.1.1: Identify a wake word. 


Here is the trick. Often, even when we do not know how to tell a computer explicitly how to map 
from inputs to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In 
other words, even if you do not know how to program a computer to recognize the word “Alexa”, 
you yourself are able to recognize it. Armed with this ability, we can collect a huge dataset con- 
taining examples of audio and label those that do and that do not contain the wake word. In the 
machine learning approach, we do not attempt to design a system explicitly to recognize wake 
words. Instead, we define a flexible program whose behavior is determined by a number of pa- 
rameters. Then we use the dataset to determine the best possible set of parameters, those that 
improve the performance of our program with respect to some measure of performance on the 
task of interest. 


You can think of the parameters as knobs that we can turn, manipulating the behavior of the 
program. Fixing the parameters, we call the program a model. The set of all distinct programs 
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(input-output mappings) that we can produce just by manipulating the parameters is called a fam- 
ily of models. And the meta-program that uses our dataset to choose the parameters is called a 
learning algorithm. 


Before we can go ahead and engage the learning algorithm, we have to define the problem pre- 
cisely, pinning down the exact nature of the inputs and outputs, and choosing an appropriate 
model family. In this case, our model receives a snippet of audio as input, and the model gener- 
ates a selection among (yes, no} as output. If all goes according to plan the model’s guesses will 
typically be correct as to whether the snippet contains the wake word. 


If we choose the right family of models, there should exist one setting of the knobs such that the 
model fires “yes” every time it hears the word “Alexa”. Because the exact choice of the wake word 
is arbitrary, we will probably need a model family sufficiently rich that, via another setting of the 
knobs, it could fire “yes” only upon hearing the word “Apricot”. We expect that the same model 
family should be suitable for “Alexa” recognition and “Apricot” recognition because they seem, 
intuitively, to be similar tasks. However, we might need a different family of models entirely if we 
want to deal with fundamentally different inputs or outputs, say if we wanted to map from images 
to captions, or from English sentences to Chinese sentences. 


As you might guess, if we just set all of the knobs randomly, it is unlikely that our model will 


, 


recognize “Alexa”, “Apricot”, or any other English word. In machine learning, the learning is the 
process by which we discover the right setting of the knobs coercing the desired behavior from 
our model. In other words, we train our model with data. As shown in Fig. 1.1.2, the training 
process usually looks like the following: 


1. Start off with a randomly initialized model that cannot do anything useful. 

2. Grab some of your data (e.g., audio snippets and corresponding (yes, no} labels). 
3. Tweak the knobs so the model sucks less with respect to those examples. 
4 


. Repeat Step 2 and 3 until the model is awesome. 


Design a model Grab new data 


Fig. 1.1.2: A typical training process. 





Update the 
model 









To summarize, rather than code up a wake word recognizer, we code up a program that can learn 
to recognize wake words, if we present it with a large labeled dataset. You can think of this act of 
determining a program's behavior by presenting it with a dataset as programming with data. That 
is to say, we can “program” a cat detector by providing our machine learning system with many 
examples of cats and dogs. This way the detector will eventually learn to emit a very large positive 
number if it is a cat, a very large negative number if it is a dog, and something closer to zero if it is 
not sure, and this barely scratches the surface of what machine learning can do. Deep learning, 
which we will explain in greater detail later, is just one among many popular methods for solving 
machine learning problems. 
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1.2 Key Components 


In our wake word example, we described a dataset consisting of audio snippets and binary labels, 
and we gave a hand-wavy sense of how we might train a model to approximate a mapping from 
snippets to classifications. This sort of problem, where we try to predict a designated unknown la- 
bel based on known inputs given a dataset consisting of examples for which the labels are known, 
is called supervised learning. This is just one among many kinds of machine learning problems. 
Later we will take a deep dive into different machine learning problems. First, we would like to 
shed more light on some core components that will follow us around, no matter what kind of 
machine learning problem we take on: 


1. The data that we can learn from. 
2. A model of how to transform the data. 
3. An objective function that quantifies how well (or badly) the model is doing. 


4. An algorithm to adjust the model’s parameters to optimize the objective function. 


1.2.1 Data 


It might go without saying that you cannot do data science without data. We could lose hundreds 
of pages pondering what precisely constitutes data, but for now, we will err on the practical side 
and focus on the key properties to be concerned with. Generally, we are concerned with a col- 
lection of examples. In order to work with data usefully, we typically need to come up with a 
suitable numerical representation. Each example (or data point, data instance, sample) typically 
consists of a set of attributes called features (or covariates), from which the model must make its 
predictions. In the supervised learning problems above, the thing to predict is a special attribute 
that is designated as the label (or target). 


If we were working with image data, each individual photograph might constitute an example, 
each represented by an ordered list of numerical values corresponding to the brightness of each 
pixel. A 200 x 200 color photograph would consist of 200 x 200 x 3 = 120000 numerical values, 
corresponding to the brightness of the red, green, and blue channels for each spatial location. 
In another traditional task, we might try to predict whether or not a patient will survive, given a 
standard set of features such as age, vital signs, and diagnoses. 


When every example is characterized by the same number of numerical values, we say that the 
data consist of fixed-length vectors and we describe the constant length of the vectors as the di- 
mensionality of the data. As you might imagine, fixed-length can be a convenient property. If we 
wanted to train a model to recognize cancer in microscopy images, fixed-length inputs mean we 
have one less thing to worry about. 


However, not all data can easily be represented as fixed-length vectors. While we might expect 
microscope images to come from standard equipment, we cannot expect images mined from the 
Internet to all show up with the same resolution or shape. For images, we might consider crop- 
ping them all to a standard size, but that strategy only gets us so far. We risk losing information 
in the cropped out portions. Moreover, text data resist fixed-length representations even more 
stubbornly. Consider the customer reviews left on e-commerce sites such as Amazon, IMDB, and 
TripAdvisor. Some are short: “it stinks!”. Others ramble for pages. One major advantage of deep 
learning over traditional methods is the comparative grace with which modern models can handle 
varying-length data. 
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Generally, the more data we have, the easier our job becomes. When we have more data, we 
can train more powerful models and rely less heavily on pre-conceived assumptions. The regime 
change from (comparatively) small to big data is a major contributor to the success of modern 
deep learning. To drive the point home, many of the most exciting models in deep learning do not 
work without large datasets. Some others work in the small data regime, but are no better than 
traditional approaches. 


Finally, it is not enough to have lots of data and to process it cleverly. We need the right data. If 
the data are full of mistakes, or if the chosen features are not predictive of the target quantity of 
interest, learning is going to fail. The situation is captured well by the cliché: garbage in, garbage 
out. Moreover, poor predictive performance is not the only potential consequence. In sensitive 
applications of machine learning, like predictive policing, resume screening, and risk models 
used for lending, we must be especially alert to the consequences of garbage data. One common 
failure mode occurs in datasets where some groups of people are unrepresented in the training 
data. Imagine applying a skin cancer recognition system in the wild that had never seen black 
skin before. Failure can also occur when the data do not merely under-represent some groups 
but reflect societal prejudices. For example, if past hiring decisions are used to train a predictive 
model that will be used to screen resumes, then machine learning models could inadvertently 
capture and automate historical injustices. Note that this can all happen without the data scientist 
actively conspiring, or even being aware. 


1.2.2 Models 


Most machine learning involves transforming the data in some sense. We might want to build a 
system that ingests photos and predicts smiley-ness. Alternatively, we might want to ingest a set of 
sensor readings and predict how normal vs. anomalous the readings are. By model, we denote the 
computational machinery for ingesting data of one type, and spitting out predictions of a possibly 
different type. In particular, we are interested in statistical models that can be estimated from 
data. While simple models are perfectly capable of addressing appropriately simple problems, 
the problems that we focus on in this book stretch the limits of classical methods. Deep learning 
is differentiated from classical approaches principally by the set of powerful models that it focuses 
on. These models consist of many successive transformations ofthe data that are chained together 
top to bottom, thus the name deep learning. On our way to discussing deep models, we will also 
discuss some more traditional methods. 


1.2.3 Objective Functions 


Earlier, we introduced machine learning as learning from experience. By learning here, we mean 
improving at some task over time. But who is to say what constitutes an improvement? You might 
imagine that we could propose to update our model, and some people might disagree on whether 
the proposed update constituted an improvement or a decline. 


In order to develop a formal mathematical system of learning machines, we need to have formal 
measures of how good (or bad) our models are. In machine learning, and optimization more 
generally, we call these objective functions. By convention, we usually define objective functions 
so that lower is better. This is merely a convention. You can take any function for which higher is 
better, and turn it into a new function that is qualitatively identical but for which lower is better 
by flipping the sign. Because lower is better, these functions are sometimes called loss functions. 


When trying to predict numerical values, the most common loss function is squared error, i.e., the 
square of the difference between the prediction and the ground-truth. For classification, the most 
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common objective is to minimize error rate, i.e., the fraction of examples on which our predic- 
tions disagree with the ground truth. Some objectives (e.g., squared error) are easy to optimize. 
Others (e.g., error rate) are difficult to optimize directly, owing to non-differentiability or other 
complications. In these cases, it is common to optimize a surrogate objective. 


Typically, the loss function is defined with respect to the model's parameters and depends upon 
the dataset. We learn the best values of our model's parameters by minimizing the loss incurred 
on a set consisting of some number of examples collected for training. However, doing well on 
the training data does not guarantee that we will do well on unseen data. So we will typically want 
to split the available data into two partitions: the training dataset (or training set, for fitting model 
parameters) and the test dataset (or test set, which is held out for evaluation), reporting how the 
model performs on both of them. You could think of training performance as being like a stu- 
dent's scores on practice exams used to prepare for some real final exam. Even if the results are 
encouraging, that does not guarantee success on the final exam. In other words, the test perfor- 
mance can deviate significantly from the training performance. When a model performs well on 
the training set but fails to generalize to unseen data, we say that itis overfitting. In real-life terms, 
this is like flunking the real exam despite doing well on practice exams. 


1.2.4 Optimization Algorithms 


Once we have gotsome data source and representation, a model, and a well-defined objective func- 
tion, we need an algorithm capable of searching for the best possible parameters for minimizing 
the loss function. Popular optimization algorithms for deep learning are based on an approach 
called gradient descent. In short, at each step, this method checks to see, for each parameter, which 
way the training set loss would move if you perturbed that parameter just a small amount. It then 
updates the parameter in the direction that may reduce the loss. 


1.3 Kinds of Machine Learning Problems 


The wake word problem in our motivating example is just one among many problems that ma- 
chine learning can tackle. To motivate the reader further and provide us with some common 
language when we talk about more problems throughout the book, in the following we list a sam- 
pling of machine learning problems. We will constantly refer to our aforementioned concepts 
such as data, models, and training techniques. 


1.3.1 Supervised Learning 


Supervised learning addresses the task of predicting labels given input features. Each feature- 
label pair is called an example. Sometimes, when the context is clear, we may use the term exam- 
ples to refer to a collection of inputs, even when the corresponding labels are unknown. Our goal 
is to produce a model that maps any input to a label prediction. 


To ground this description in a concrete example, if we were working in healthcare, then we might 
wantto predict whether or nota patient would have a heart attack. This observation, “heart attack” 
or “no heart attack”, would be our label. The input features might be vital signs such as heart rate, 
diastolic blood pressure, and systolic blood pressure. 


The supervision comes into play because for choosing the parameters, we (the supervisors) pro- 
vide the model with a dataset consisting of labeled examples, where each example is matched with 
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the ground-truth label. In probabilistic terms, we typically are interested in estimating the con- 
ditional probability of a label given input features. While it is just one among several paradigms 
within machine learning, supervised learning accounts for the majority of successful applications 
of machine learning in industry. Partly, that is because many important tasks can be described 
crisply as estimating the probability of something unknown given a particular set of available data: 


e Predict cancer vs. not cancer, given a computer tomography image. 
e Predict the correct translation in French, given a sentence in English. 
e Predict the price of a stock next month based on this month’s financial reporting data. 


Even with the simple description “predicting labels given input features” supervised learning can 
take a great many forms and require a great many modeling decisions, depending on (among 
other considerations) the type, size, and the number of inputs and outputs. For example, we use 
different models to process sequences of arbitrary lengths and for processing fixed-length vector 
representations. We will visit many of these problems in depth throughout this book. 


Informally, the learning process looks something like the following. First, grab a big collection of 
examples for which the features are known and select from them a random subset, acquiring the 
ground-truth labels for each. Sometimes these labels might be available data that have already 
been collected (e.g., did a patient die within the following year?) and other times we might need 
to employ human annotators to label the data, (e.g., assigning images to categories). Together, 
these inputs and corresponding labels comprise the training set. We feed the training dataset 
into a supervised learning algorithm, a function that takes as input a dataset and outputs another 
function: the learned model. Finally, we can feed previously unseen inputs to the learned model, 
using its outputs as predictions of the corresponding label. The full process is drawn in Fig. 1.3.1. 


Training inputs Îl Training labels 1 
Output 


Fig. 1.3.1: Supervised learning. 






Supervised 
learning 








Regression 


Perhaps the simplest supervised learning task to wrap your head around is regression. Consider, 
for example, a set of data harvested from a database of home sales. We might construct a table, 
where each row corresponds to a different house, and each column corresponds to some relevant 
attribute, such as the square footage of a house, the number of bedrooms, the number of bath- 
rooms, and the number of minutes (walking) to the center of town. In this dataset, each example 
would be a specific house, and the corresponding feature vector would be one row in the table. 
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, 
or Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vec- 
tor for your home might look something like: [600, 1, 1, 60]. However, if you live in Pittsburgh, it 
might look more like [3000, 4, 3, 10]. Feature vectors like this are essential for most classic machine 
learning algorithms. 


What makes a problem a regression is actually the output. Say that you are in the market for a 
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new home. You might want to estimate the fair market value of a house, given some features like 
above. The label, the price of sale, is a numerical value. When labels take on arbitrary numerical 
values, we call this a regression problem. Our goal is to produce a model whose predictions closely 
approximate the actual label values. 


Lots of practical problems are well-described regression problems. Predicting the rating that a 
user will assign to a movie can be thought of as a regression problem and if you designed a great 
algorithm to accomplish this feat in 2009, you might have won the 1-million-dollar Netflix prize’. 
Predicting the length of stay for patients in the hospital is also a regression problem. A good rule 


of thumb is that any how much? or how many? problem should suggest regression, such as: 
e How many hours will this surgery take? 
* How much rainfall will this town have in the next six hours? 


Even if you have never worked with machine learning before, you have probably worked through 
a regression problem informally. Imagine, for example, that you had your drains repaired and 
that your contractor spent 3 hours removing gunk from your sewage pipes. Then he sent you a 
bill of 350 dollars. Now imagine that your friend hired the same contractor for 2 hours and that he 
received a bill of 250 dollars. If someone then asked you how much to expect on their upcoming 
gunk-removal invoice you might make some reasonable assumptions, such as more hours worked 
costs more dollars. You might also assume that there is some base charge and that the contractor 
then charges per hour. If these assumptions held true, then given these two data examples, you 
could already identify the contractor’s pricing structure: 100 dollars per hour plus 50 dollars to 
show up at your house. If you followed that much then you already understand the high-level idea 
behind linear regression. 


In this case, we could produce the parameters that exactly matched the contractor’s prices. Some- 
times this is not possible, e.g., if some of the variance owes to a few factors besides your two fea- 
tures. In these cases, we will try to learn models that minimize the distance between our predic- 
tions and the observed values. In most of our chapters, we will focus on minimizing the squared 
error loss function. As we will see later, this loss corresponds to the assumption that our data 
were corrupted by Gaussian noise. 


Classification 


While regression models are great for addressing how many? questions, lots of problems do not 
bend comfortably to this template. For example, a bank wants to add check scanning to its mobile 
app. This would involve the customer snapping a photo of acheck with their smart phone’s camera 
and the app would need to be able to automatically understand text seen in the image. Specifically, 
it would also need to understand handwritten text to be even more robust, such as mapping a 
handwritten character to one of the known characters. This kind of which one? problem is called 
classification. Itis treated with a different set of algorithms than those used for regression although 
many techniques will carry over. 


In classification, we want our model to look at features, e.g., the pixel values in an image, and then 
predict which category (formally called class), among some discrete set of options, an example 
belongs. For handwritten digits, we might have ten classes, corresponding to the digits 0 through 
9. The simplest form of classification is when there are only two classes, a problem which we call 
binary classification. For example, our dataset could consist of images of animals and our labels 
might be the classes {cat, dog}. While in regression, we sought a regressor to output a numerical 
value, in classification, we seek a classifier, whose output is the predicted class assignment. 





14 https://en.wikipedia.org/wiki/Netflix_Prize 
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For reasons that we will get into as the book gets more technical, it can be hard to optimize a 
model that can only output a hard categorical assignment, e.g., either “cat” or “dog”. In these 
cases, it is usually much easier to instead express our model in the language of probabilities. Given 
features of an example, our model assigns a probability to each possible class. Returning to our 
animal classification example where the classes are {cat, dog}, a classifier might see an image and 
output the probability that the image is a cat as 0.9. We can interpret this number by saying that 
the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for the 
predicted class conveys one notion of uncertainty. It is not the only notion of uncertainty and we 
will discuss others in more advanced chapters. 


When we have more than two possible classes, we call the problem multiclass classification. Com- 
mon examples include hand-written character recognition {0,1,2,...9,a,b,c,...}. While we at- 
tacked regression problems by trying to minimize the squared error loss function, the common 
loss function for classification problems is called cross-entropy, whose name can be demystified 
via an introduction to information theory in subsequent chapters. 


Note that the most likely class is not necessarily the one that you are going to use for your decision. 
Assume that you find a beautiful mushroom in your backyard as shown in Fig. 1.3.2. 





Fig. 1.3.2: Death cap—do not eat! 


Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based 
on a photograph. Say our poison-detection classifier outputs that the probability that Fig. 1.3.2 
contains a death cap is 0.2. In other words, the classifier is 80% sure that our mushroom is not 
a death cap. Still, you would have to be a fool to eat it. That is because the certain benefit of a 
delicious dinner is not worth a 20% risk of dying from it. In other words, the effect of the uncertain 
risk outweighs the benefit by far. Thus, we need to compute the expected risk that we incur as the 
loss function, i.e., we need to multiply the probability of the outcome with the benefit (or harm) 
associated with it. In this case, the loss incurred by eating the mushroom can be 0.2 x oo + 0.8 0 = 
oo, whereas the loss of discarding it is 0.2 x 0 + 0.8 x 1 = 0.8. Our caution was justified: as any 
mycologist would tell us, the mushroom in Fig. 1.3.2 actually is a death cap. 


Classification can get much more complicated than just binary, multiclass, or even multi-label 
classification. For instance, there are some variants of classification for addressing hierarchies. 
Hierarchies assume that there exist some relationships among the many classes. So not all er- 
rors are equal—if we must err, we would prefer to misclassify to a related class rather than to a 
distant class. Usually, this is referred to as hierarchical classification. One early example is due to 
Linnaeus?”, who organized the animals in a hierarchy. 





15 https://en.wikipedia.org/wiki/Carl_Linnaeus 
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In the case of animal classification, it might not be so bad to mistake a poodle (a dog breed) for 
a schnauzer (another dog breed), but our model would pay a huge penalty if it confused a poodle 
for a dinosaur. Which hierarchy is relevant might depend on how you plan to use the model. For 
example, rattle snakes and garter snakes might be close on the phylogenetic tree, but mistaking a 
rattler for a garter could be deadly. 


Tagging 


Some classification problems fit neatly into the binary or multiclass classification setups. For ex- 
ample, we could train a normal binary classifier to distinguish cats from dogs. Given the current 
state of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter 
how accurate our model gets, we might find ourselves in trouble when the classifier encounters 
an image of the Town Musicians of Bremen, a popular German fairy tale featuring four animals in 
Fig. 1.3.3. 





Fig. 1.3.3: A donkey, a dog, a cat, and a rooster. 


As you can see, there is a cat in Fig. 1.3.3, and a rooster, a dog, and a donkey, with some trees in 
the background. Depending on what we want to do with our model ultimately, treating this as a 
binary classification problem might not make a lot of sense. Instead, we might want to give the 
model the option of saying the image depicts a cat, a dog, a donkey, and a rooster. 


The problem of learning to predict classes that are not mutually exclusive is called multi-label clas- 
sification. Auto-tagging problems are typically best described as multi-label classification prob- 
lems. Think of the tags people might apply to posts on a technical blog, e.g., “machine learning”, 


“technology”, “gadgets”, “programming languages”, “Linux”, “cloud computing”, “AWS”. A typical 
article might have 5-10 tags applied because these concepts are correlated. Posts about “cloud 
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computing” are likely to mention “AWS” and posts about “machine learning” could also deal with 
“programming languages”. 


We also have to deal with this kind of problem when dealing with the biomedical literature, where 
correctly tagging articles is important because it allows researchers to do exhaustive reviews of 
the literature. Atthe National Library of Medicine, a number of professional annotators go over 
each article that gets indexed in PubMed to associate it with the relevant terms from MeSH, a 
collection of roughly 28000 tags. This is a time-consuming process and the annotators typically 
have a one-year lag between archiving and tagging. Machine learning can be used here to provide 
provisional tags until each article can have a proper manual review. Indeed, for several years, the 
BioASQ organization has hosted competitions** to do precisely this. 


Search 


Sometimes we do not just want to assign each example to a bucket or to a real value. In the field 
of information retrieval, we want to impose a ranking on a set of items. Take web search for an 
example. The goal is less to determine whether a particular page is relevant for a query, but rather, 
which one of the plethora of search results is most relevant for a particular user. We really care 
about the ordering of the relevant search results and our learning algorithm needs to produce 
ordered subsets of elements from a larger set. In other words, if we are asked to produce the first 
5 letters from the alphabet, there is a difference between returning “A BC D E” and “CA BED”. 
Even if the result set is the same, the ordering within the set matters. 


One possible solution to this problem is to first assign to every element in the set a corresponding 
relevance score and then to retrieve the top-rated elements. PageRank”, the original secret sauce 
behind the Google search engine was an early example of such a scoring system but it was peculiar 
in that it did not depend on the actual query. Here they relied on a simple relevance filter to 
identify the set of relevant items and then on PageRank to order those results that contained the 
query term. Nowadays, search engines use machine learning and behavioral models to obtain 
query-dependent relevance scores. There are entire academic conferences devoted to this subject. 


Recommender Systems 


Recommender systems are another problem setting that is related to search and ranking. The 
problems are similar insofar as the goal is to display a set of relevant items to the user. The main 
difference is the emphasis on personalization to specific users in the context of recommender sys- 
tems. For instance, for movie recommendations, the results page for a science fiction fan and 
the results page for a connoisseur of Peter Sellers comedies might differ significantly. Similar 
problems pop up in other recommendation settings, e.g., for retail products, music, and news 
recommendation. 


In some cases, customers provide explicit feedback communicating how much they liked a partic- 
ular product (e.g., the product ratings and reviews on Amazon, IMDb, and GoodReads). In some 
other cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which might in- 
dicate dissatisfaction but might just indicate that the song was inappropriate in context. In the 
simplest formulations, these systems are trained to estimate some score, such as an estimated 
rating or the probability of purchase, given a user and an item. 





16 http://bioasq.org/ 
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Given such a model, for any given user, we could retrieve the set of objects with the largest scores, 
which could then be recommended to the user. Production systems are considerably more ad- 
vanced and take detailed user activity and item characteristics into account when computing such 
scores. Fig. 1.3.4 is an example of deep learning books recommended by Amazon based on per- 
sonalization algorithms tuned to capture one's preferences. 


Best Seller 





ANTHONY $. WILLIAMS 


Deep Learning with Keras: 
Introduction to Deep Learning 
with Keras Jul 5, 2017 

$0.00 - $14.55 ¿prime 
Paperback, Kindle Edition 


MA -15 


Deep Learning (Adaptive 
Computation and Machine 
Learning series) Nov 18, 2016 Generation Machine 

$63.99 - $72.00 ¿prime Intelligence Algorithms Jun 29, 
Hardcover, Kindle Edition 2017 


BE $22.49 - $34.60 
WWW -85 ¿prime | FREE One-Day 


Paperback, Kindle Edition 
MAA -5 


Fundamentals of Deep 
Learning: Designing Next- 


Hands-On Machine Learning 
with Scikit-Learn and 
TensorFlow: Concepts, Tools, 
and Techniques to Build. 
$24.99 - $31.57 

vprime | FREE One-Day 
Paperback, Kindle Edition 


Akki + 58 


Deep Learning: A Practitioner's 
Approach Aug 20, 2017 


$28.56 prime 
Paperback 


—— for 
Absolute Beginners 


Python Deep 
Learning 


Deep Learning 
with Keras 





Deep Learning with Keras Apr 
26, 2017 

$32.49 - $49.99 ¿prime 
Paperback, Kindle Edition 


Python Deep Learning Apr 28, 
2017 

$39.59 - $49.49 ¿prime 
Paperback, Kindle Edition 


Neural Networks and Deep 
Learning Apr 13, 2017 


$0.00 - $13.73 ¿prime 
Paperback, Kindle Edition 


Deep Learning with Python 
Oct 31, 2017 


$47.49 ¿prime 
Paperback 


Machine Learning for Absolute 
Beginners: A Plain English 
Introduction Apr 3, 2017 

$0.00 - $9.89 ¿prime 


Paperback, Kindle Edition 
Monk + 33 


MARA -12 Ak -2 HAZ + 20 


Fig. 1.3.4: Deep learning books recommended by Amazon. 


Despite their tremendous economic value, recommendation systems naively built on top of pre- 
dictive models suffer some serious conceptual flaws. To start, we only observe censored feedback: 
users preferentially rate movies that they feel strongly about. For example, on a five-point scale, 
you might notice that items receive many five and one star ratings but that there are conspicu- 
ously few three-star ratings. Moreover, current purchase habits are often a result of the recom- 
mendation algorithm currently in place, but learning algorithms do not always take this detail 
into account. Thus it is possible for feedback loops to form where a recommender system pref- 
erentially pushes an item that is then taken to be better (due to greater purchases) and in turn is 
recommended even more frequently. Many of these problems about how to deal with censoring, 
incentives, and feedback loops, are important open research questions. 





28 Chapter 1. Introduction 


Sequence Learning 


So far, we have looked at problems where we have some fixed number of inputs and produce a 
fixed number of outputs. For example, we considered predicting house prices from a fixed set of 
features: square footage, number of bedrooms, number of bathrooms, walking time to downtown. 
We also discussed mapping from an image (of fixed dimension) to the predicted probabilities that 
it belongs to each of a fixed number of classes, or taking a user ID and a product ID, and predicting 
a star rating. In these cases, once we feed our fixed-length input into the model to generate an 
output, the model immediately forgets what it just saw. 


This might be fine if our inputs truly all have the same dimensions and if successive inputs truly 
have nothing to do with each other. But how would we deal with video snippets? In this case, 
each snippet might consist of a different number of frames. And our guess of what is going on in 
each frame might be much stronger if we take into account the previous or succeeding frames. 
Same goes for language. One popular deep learning problem is machine translation: the task of 
ingesting sentences in some source language and predicting their translation in another language. 


These problems also occur in medicine. We might want a model to monitor patients in the in- 
tensive care unit and to fire off alerts if their risk of death in the next 24 hours exceeds some 
threshold. We definitely would not want this model to throw away everything it knows about the 
patient history each hour and just make its predictions based on the most recent measurements. 


These problems are among the most exciting applications of machine learning and they are in- 
stances of sequence learning. They require a model to either ingest sequences of inputs or to 
emit sequences of outputs (or both). Specifically, sequence to sequence learning considers prob- 
lems where input and output are both variable-length sequences, such as machine translation and 
transcribing text from the spoken speech. While it is impossible to consider all types of sequence 
transformations, the following special cases are worth mentioning. 


Tagging and Parsing. This involves annotating a text sequence with attributes. In other words, 
the number of inputs and outputs is essentially the same. For instance, we might want to know 
where the verbs and subjects are. Alternatively, we might want to know which words are the 
named entities. In general, the goal is to decompose and annotate text based on structural and 
grammatical assumptions to get some annotation. This sounds more complex than it actually is. 
Below is a very simple example of annotating a sentence with tags indicating which words refer 
to named entities (tagged as “Ent”. 


Tom has dinner in Washington with Sally 
ENS = = Ent = Ent 


Automatic Speech Recognition. With speech recognition, the input sequence is an audio record- 
ing of a speaker (shown in Fig. 1.3.5), and the output is the textual transcript of what the speaker 
said. The challenge is that there are many more audio frames (sound is typically sampled at 8kHz 
or 16kHz) than text, i.e., there is no 1:1 correspondence between audio and text, since thousands 
of samples may correspond to a single spoken word. These are sequence to sequence learning 
problems where the output is much shorter than the input. 
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Fig. 1.3.5: -D-e-e-p- L-ea-r-ni-ng- in an audio recording. 


Text to Speech. This is the inverse of automatic speech recognition. In other words, the input is 
text and the output is an audio file. In this case, the output is much longer than the input. While 
it is easy for humans to recognize a bad audio file, this is not quite so trivial for computers. 


Machine Translation. Unlike the case of speech recognition, where corresponding inputs and 
outputs occur in the same order (after alignment), in machine translation, order inversion can be 
vital. In other words, while we are still converting one sequence into another, neither the number 
of inputs and outputs nor the order of corresponding data examples are assumed to be the same. 
Consider the following illustrative example of the peculiar tendency of Germans to place the verbs 
at the end of sentences. 


German: Haben Sie sich schon dieses grossartige Lehrwerk angeschaut? 
English: Did you already check out this excellent tutorial? 
Wrong alignment: Did you yourself already this excellent tutorial looked-at? 


Many related problems pop up in other learning tasks. For instance, determining the order in 
which a user reads a webpage is a two-dimensional layout analysis problem. Dialogue problems 
exhibit all kinds of additional complications, where determining what to say next requires taking 
into account real-world knowledge and the prior state of the conversation across long temporal 
distances. These are active areas of research. 


1.3.2 Unsupervised learning 


All the examples so far were related to supervised learning, i.e., situations where we feed the 
model a giant dataset containing both the features and corresponding label values. You could 
think of the supervised learner as having an extremely specialized job and an extremely banal 
boss. The boss stands over your shoulder and tells you exactly what to do in every situation until 
you learn to map from situations to actions. Working for such a boss sounds pretty lame. On the 
other hand, it is easy to please this boss. You just recognize the pattern as quickly as possible and 
imitate their actions. 


In a completely opposite way, it could be frustrating to work for a boss who has no idea what they 
want you to do. However, if you plan to be a data scientist, you had better get used to it. The boss 
might just hand you a giant dump of data and tell you to do some data science with it! This sounds 
vague because it is. We call this class of problems unsupervised learning, and the type and number 
of questions we could ask is limited only by our creativity. We will address unsupervised learning 
techniques in later chapters. To whet your appetite for now, we describe a few of the following 
questions you might ask. 


e Can we find a small number of prototypes that accurately summarize the data? Given a 
set of photos, can we group them into landscape photos, pictures of dogs, babies, cats, and 
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mountain peaks? Likewise, given a collection of users' browsing activities, can we group 
them into users with similar behavior? This problem is typically known as clustering. 


Can we find a small number of parameters that accurately capture the relevant properties of 
the data? The trajectories of a ball are quite well described by velocity, diameter, and mass 
of the ball. Tailors have developed a small number of parameters that describe human body 
shape fairly accurately for the purpose of fitting clothes. These problems are referred to as 
subspace estimation. Ifthe dependence is linear, itis called principal component analysis. 


Is there a representation of (arbitrarily structured) objects in Euclidean space such that sym- 
bolic properties can be well matched? This can be used to describe entities and their rela- 
tions, such as “Rome” — “Italy” + “France” = “Paris”. 


Is there a description of the root causes of much of the data that we observe? For instance, 
if we have demographic data about house prices, pollution, crime, location, education, and 
salaries, can we discover how they are related simply based on empirical data? The fields 
concerned with causality and probabilistic graphical models address this problem. 


Another important and exciting recent development in unsupervised learning is the advent 
of generative adversarial networks. These give us a procedural way to synthesize data, even 
complicated structured data like images and audio. The underlying statistical mechanisms 
are tests to check whether real and fake data are the same. 


1.3.3 Interacting with an Environment 


So far, we have not discussed where data actually come from, or what actually happens when a 
machine learning model generates an output. That is because supervised learning and unsuper- 
vised learning do not address these issues in a very sophisticated way. In either case, we grab a big 
pile of data upfront, then set our pattern recognition machines in motion without ever interacting 
with the environment again. Because all of the learning takes place after the algorithm is discon- 
nected from the environment, this is sometimes called offline learning. For supervised learning, 
the process by considering data collection from an environment looks like Fig. 1.3.6. 


Environment 
Supervised 
learning 









Training inputs Îl Training labels Îl 


Output 


Fig. 1.3.6: Collecting data for supervised learning from an environment. 





This simplicity of offline learning has its charms. The upside is that we can worry about pattern 
recognition in isolation, without any distraction from these other problems. But the downside 
is that the problem formulation is quite limiting. If you are more ambitious, or if you grew up 
reading Asimov’s Robot series, then you might imagine artificially intelligent bots capable not only 
of making predictions, but also of taking actions in the world. We want to think about intelligent 
agents, not just predictive models. This means that we need to think about choosing actions, not 
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just making predictions. Moreover, unlike predictions, actions actually impact the environment. 
If we want to train an intelligent agent, we must account for the way its actions might impact the 
future observations of the agent. 


Considering the interaction with an environment opens a whole set of new modeling questions. 
The following are just a few examples. 


* Does the environment remember what we did previously? 


Does the environment want to help us, e.g., a user reading text into a speech recognizer? 


Does the environment want to beat us, i.e., an adversarial setting like spam filtering (against 
spammers) or playing a game (vs. an opponent)? 


Does the environment not care? 


Does the environment have shifting dynamics? For example, does future data always re- 
semble the past or do the patterns change over time, either naturally or in response to our 
automated tools? 


This last question raises the problem of distribution shift, when training and test data are different. 
Itis a problem that most of us have experienced when taking exams written by a lecturer, while the 
homework was composed by his teaching assistants. Next, we will briefly describe reinforcement 
learning, a setting that explicitly considers interactions with an environment. 


1.3.4 Reinforcement Learning 


If you are interested in using machine learning to develop an agent that interacts with an environ- 
ment and takes actions, then you are probably going to wind up focusing on reinforcement learning. 
This might include applications to robotics, to dialogue systems, and even to developing artificial 
intelligence (AI) for video games. Deep reinforcement learning, which applies deep learning to rein- 
forcement learning problems, has surged in popularity. The breakthrough deep Q-network that 
beat humans at Atari games using only the visual input, and the AlphaGo program that dethroned 
the world champion at the board game Go are two prominent examples. 


Reinforcement learning gives a very general statement of a problem, in which an agent interacts 
with an environment over a series of time steps. At each time step, the agent receives some ob- 
servation from the environment and must choose an action that is subsequently transmitted back 
to the environment via some mechanism (sometimes called an actuator). Finally, the agent re- 
ceives a reward from the environment. This process is illustrated in Fig. 1.3.7. The agent then 
receives a subsequent observation, and chooses a subsequent action, and so on. The behavior of 
an reinforcement learning agent is governed by a policy. In short, a policy is just a function that 
maps from observations of the environment to actions. The goal of reinforcement learning is to 
produce a good policy. 
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Environment 


Observation 


Fig. 1.3.7: The interaction between reinforcement learning and an environment. 


It is hard to overstate the generality of the reinforcement learning framework. For example, we 
can cast any supervised learning problem as a reinforcement learning problem. Say we hada clas- 
sification problem. We could create a reinforcement learning agent with one action correspond- 
ing to each class. We could then create an environment which gave a reward that was exactly equal 
to the loss function from the original supervised learning problem. 


That being said, reinforcement learning can also address many problems that supervised learn- 
ing cannot. For example, in supervised learning we always expect that the training input comes 
associated with the correct label. But in reinforcement learning, we do not assume that for each 
observation the environment tells us the optimal action. In general, we just get some reward. 
Moreover, the environment may not even tell us which actions led to the reward. 


Consider for example the game of chess. The only real reward signal comes at the end of the 
game when we either win, which we might assign a reward of 1, or when we lose, which we could 
assign a reward of -1. So reinforcement learners must deal with the credit assignment problem: 
determining which actions to credit or blame for an outcome. The same goes for an employee 
who gets a promotion on October 11. That promotion likely reflects a large number of well-chosen 
actions over the previous year. Getting more promotions in the future requires figuring out what 
actions along the way led to the promotion. 


Reinforcement learners may also have to deal with the problem of partial observability. That is, 
the current observation might not tell you everything about your current state. Say a cleaning 
robot found itself trapped in one of many identical closets in a house. Inferring the precise lo- 
cation (and thus state) of the robot might require considering its previous observations before 
entering the closet. 


Finally, at any given point, reinforcement learners might know of one good policy, but there might 
be many other better policies that the agent has never tried. The reinforcement learner must 
constantly choose whether to exploit the best currently-known strategy as a policy, or to explore 
the space of strategies, potentially giving up some short-run reward in exchange for knowledge. 


The general reinforcement learning problem is a very general setting. Actions affect subsequent 
observations. Rewards are only observed corresponding to the chosen actions. The environment 
may be either fully or partially observed. Accounting for all this complexity at once may ask too 
much of researchers. Moreover, not every practical problem exhibits all this complexity. As a 
result, researchers have studied a number of special cases of reinforcement learning problems. 


When the environment is fully observed, we call the reinforcement learning problem a Markov 
decision process. When the state does not depend on the previous actions, we call the problem 
a contextual bandit problem. When there is no state, just a set of available actions with initially 
unknown rewards, this problem is the classic multi-armed bandit problem. 
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1.4 Roots 


We have just reviewed a small subset of problems that machine learning can address. For a di- 
verse set of machine learning problems, deep learning provides powerful tools for solving them. 
Although many deep learning methods are recent inventions, the core idea of programming with 
data and neural networks (names of many deep learning models) has been studied for centuries. 
In fact, humans have held the desire to analyze data and to predict future outcomes for long and 
much of natural science has its roots in this. For instance, the Bernoulli distribution is named af- 
ter Jacob Bernoulli (1655-1705)*%, and the Gaussian distribution was discovered by Carl Friedrich 
Gauss (1777-1855). He invented, for instance, the least mean squares algorithm, which is still 
used today for countless problems from insurance calculations to medical diagnostics. These 
tools gave rise to an experimental approach in the natural sciences—for instance, Ohm’s law re- 
lating current and voltage in a resistor is perfectly described by a linear model. 


Even in the middle ages, mathematicians had a keen intuition of estimates. For instance, the 
geometry book of Jacob Kóbel (1460-1533)”° illustrates averaging the length of 16 adult men's feet 
to obtain the average foot length. 
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Fig. 1.4.1: Estimating the length of a foot. 


Fig. 1.4.1 illustrates how this estimator works. The 16 adult men were asked to line up in a row, 
when leaving the church. Their aggregate length was then divided by 16 to obtain an estimate for 
what now amounts to 1 foot. This “algorithm” was later improved to deal with misshapen feet— 
the 2 men with the shortest and longest feet respectively were sent away, averaging only over the 
remainder. This is one of the earliest examples of the trimmed mean estimate. 





18 https://en.wikipedia.org/wiki/Jacob_Bernoulli 
1 https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss 
2 https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry 
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Statistics really took off with the collection and availability of data. One ofits titans, Ronald Fisher 
(1890-1962)?*, contributed significantly to its theory and also its applications in genetics. Many of 
his algorithms (such as linear discriminant analysis) and formula (such as the Fisher information 
matrix) are still in frequent use today. In fact, even the Iris dataset that Fisher released in 1936 
is still used sometimes to illustrate machine learning algorithms. He was also a proponent of 
eugenics, which should remind us that the morally dubious use of data science has as long and 
enduring a history as its productive use in industry and the natural sciences. 


A second influence for machine learning came from information theory by Claude Shannon 
(1916-2001) and the theory of computation via Alan Turing (1912-1954), Turing posed the 
question “can machines think?” in his famous paper Computing Machinery and Intelligence (Tur- 
ing, 1950). In what he described as the Turing test, a machine can be considered intelligent if it is 
difficult for a human evaluator to distinguish between the replies from a machine and a human 
based on textual interactions. 


Another influence can be found in neuroscience and psychology. After all, humans clearly exhibit 
intelligent behavior. Itis thus only reasonable to ask whether one could explain and possibly re- 
verse engineer this capacity. One of the oldest algorithms inspired in this fashion was formulated 
by Donald Hebb (1904-1985)**. In his groundbreaking book The Organization of Behavior (Hebb 
& Hebb, 1949), he posited that neurons learn by positive reinforcement. This became known as 
the Hebbian learning rule. It is the prototype of Rosenblatt’s perceptron learning algorithm and it 
laid the foundations of many stochastic gradient descent algorithms that underpin deep learning 
today: reinforce desirable behavior and diminish undesirable behavior to obtain good settings of 
the parameters in a neural network. 


Biological inspiration is what gave neural networks their name. For over a century (dating back 
to the models of Alexander Bain, 1873 and James Sherrington, 1890), researchers have tried to 
assemble computational circuits that resemble networks of interacting neurons. Over time, the 
interpretation of biology has become less literal but the name stuck. At its heart, lie a few key 
principles that can be found in most networks today: 


* The alternation of linear and nonlinear processing units, often referred to as layers. 


e The use of the chain rule (also known as backpropagation) for adjusting parameters in the 
entire network at once. 


After initial rapid progress, research in neural networks languished from around 1995 until 2005. 
This was mainly due to two reasons. First, training a network is computationally very expensive. 
While random-access memory was plentiful at the end of the past century, computational power 
was scarce. Second, datasets were relatively small. In fact, Fisher’s Iris dataset from 1932 was a 
popular tool for testing the efficacy of algorithms. The MNIST dataset with its 60000 handwritten 
digits was considered huge. 


Given the scarcity of data and computation, strong statistical tools such as kernel methods, deci- 
sion trees and graphical models proved empirically superior. Unlike neural networks, they did 
not require weeks to train and provided predictable results with strong theoretical guarantees. 





2! https://en.wikipedia.org/wiki/Ronald_Fisher 

2 https://en.wikipedia.org/wiki/Claude_Shannon 
% https://en.wikipedia.org/wiki/Alan_Turing 

2 https://en.wikipedia.org/wiki/Donald_O._Hebb 
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1.5 The Road to Deep Learning 


Much of this changed with the ready availability of large amounts of data, due to the World Wide 
Web, the advent of companies serving hundreds of millions of users online, a dissemination of 
cheap, high-quality sensors, cheap data storage (Kryder's law), and cheap computation (Moore's 
law), in particular in the form of GPUs, originally engineered for computer gaming. Suddenly 
algorithms and models that seemed computationally infeasible became relevant (and vice versa). 
This is best illustrated in Table 1.5.1. 


Table 1.5.1: Dataset vs. computer memory and computa- 
tional power 























Decade | Dataset Memory | Floating point calculations per second 
1970 100 (Iris) 1KB 100 KF (Intel 8080) 

1980 1 K (House prices in Boston) 100KB | 1 MF (Intel 80186) 

1990 10 K (optical character recognition) | 10 MB 10 MF (Intel 80486) 

2000 10 M (web pages) 100 MB | 1 GF (Intel Core) 

2010 10 G (advertising) 1 GB 1 TF (Nvidia C2050) 

2020 1 T (social network) 100 GB | 1 PF (Nvidia DGX-2) 




















It is evident that random-access memory has not kept pace with the growth in data. At the same 
time, the increase in computational power has outpaced that of the data available. This means that 
statistical models need to become more memory efficient (this is typically achieved by adding non- 
linearities) while simultaneously being able to spend more time on optimizing these parameters, 
due to an increased computational budget. Consequently, the sweet spot in machine learning and 
statistics moved from (generalized) linear models and kernel methods to deep neural networks. 
This is also one of the reasons why many of the mainstays of deep learning, such as multilayer 
perceptrons (McCulloch & Pitts, 1943), convolutional neural networks (LeCun et al., 1998), long 
short-term memory (Hochreiter & Schmidhuber, 1997), and Q-Learning (Watkins & Dayan, 1992), 
were essentially “rediscovered” in the past decade, after laying comparatively dormant for con- 
siderable time. 


The recent progress in statistical models, applications, and algorithms has sometimes been 
likened to the Cambrian explosion: a moment of rapid progress in the evolution of species. In- 
deed, the state of the art is not just a mere consequence of available resources, applied to decades 
old algorithms. Note that the list below barely scratches the surface of the ideas that have helped 
researchers achieve tremendous progress over the past decade. 


e Novel methods for capacity control, such as dropout (Srivastava et al., 2014), have helped to 
mitigate the danger of overfitting. This was achieved by applying noise injection (Bishop, 
1995) throughout the neural network, replacing weights by random variables for training 
purposes. 


Attention mechanisms solved a second problem that had plagued statistics for over a cen- 
tury: how to increase the memory and complexity of a system without increasing the num- 
ber of learnable parameters. Researchers found an elegant solution by using what can only 
be viewed as a learnable pointer structure (Bahdanau et al., 2014). Rather than having to 
remember an entire text sequence, e.g., for machine translation in a fixed-dimensional rep- 
resentation, all that needed to be stored was a pointer to the intermediate state of the trans- 
lation process. This allowed for significantly increased accuracy for long sequences, since 
the model no longer needed to remember the entire sequence before commencing the gen- 
eration of a new sequence. 
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e Multi-stage designs, e.g., via the memory networks (Sukhbaatar et al., 2015) and the neural 
programmer-interpreter (Reed & DeFreitas, 2015) allowed statistical modelers to describe 
iterative approaches to reasoning. These tools allow for an internal state of the deep neural 
network to be modified repeatedly, thus carrying out subsequent steps in a chain of reason- 
ing, similar to how a processor can modify memory for a computation. 


Another key development was the invention of generative adversarial networks (Goodfellow 
etal., 2014). Traditionally, statistical methods for density estimation and generative models 
focused on finding proper probability distributions and (often approximate) algorithms for 
sampling from them. As a result, these algorithms were largely limited by the lack of flex- 
ibility inherent in the statistical models. The crucial innovation in generative adversarial 
networks was to replace the sampler by an arbitrary algorithm with differentiable parame- 
ters. These are then adjusted in such a way that the discriminator (effectively a two-sample 
test) cannot distinguish fake from real data. Through the ability to use arbitrary algorithms 
to generate data, it opened up density estimation to a wide variety of techniques. Examples 
of galloping Zebras (Zhu et al., 2017) and of fake celebrity faces (Karras et al., 2017) are both 
testimony to this progress. Even amateur doodlers can produce photorealistic images based 
on just sketches that describe how the layout of a scene looks like (Park et al., 2019). 


In many cases, a single GPU is insufficient to process the large amounts of data available 
for training. Over the past decade the ability to build parallel and distributed training al- 
gorithms has improved significantly. One of the key challenges in designing scalable algo- 
rithms is that the workhorse of deep learning optimization, stochastic gradient descent, re- 
lies on relatively small minibatches of data to be processed. Atthe same time, small batches 
limit the efficiency of GPUs. Hence, training on 1024 GPUs with a minibatch size of, say 32 
images per batch amounts to an aggregate minibatch of about 32000 images. Recent work, 
first by Li (Li, 2017), and subsequently by (You et al., 2017) and (Jia et al., 2018) pushed the size 
up to 64000 observations, reducing training time for the ResNet-50 model on the ImageNet 
dataset to less than 7 minutes. For comparison—initially training times were measured in 
the order of days. 


The ability to parallelize computation has also contributed quite crucially to progress in re- 
inforcement learning, at least whenever simulation is an option. This has led to significant 
progress in computers achieving superhuman performance in Go, Atari games, Starcraft, 
and in physics simulations (e.g., using MuJoCo). See e.g., (Silver et al., 2016) for a descrip- 
tion of how to achieve this in AlphaGo. In a nutshell, reinforcement learning works best if 
plenty of (state, action, reward) triples are available, i.e., whenever it is possible to try out 
lots of things to learn how they relate to each other. Simulation provides such an avenue. 


Deep learning frameworks have played a crucial role in disseminating ideas. The first 
generation of frameworks allowing for easy modeling encompassed Caffe’, Torch*, and 
Theano”. Many seminal papers were written using these tools. By now, they have been su- 
perseded by TensorFlow”* (often used via its high level API Keras”’), CNTK*, Caffe 2%, and 
Apache MXNet**. The third generation of tools, namely imperative tools for deep learning, 
was arguably spearheaded by Chainer”, which used a syntax similar to Python NumPy to 





2 https://github.com/BVLC/caffe 

26 https://github.com/torch 

2 https://github.com/Theano/Theano 

2 https://github.com/tensorflow/tensorflow 

2 https://github.com/keras-team/keras 

3% https://github.com/Microsoft/CNTK 

3 https://github.com/caffe2/caffe2 

2 https://github.com/apache/incubator-mxnet 
33 https://github.com/chainer/chainer 
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describe models. This idea was adopted by both PyTorch**, the Gluon API? of MXNet, and 
fax”, 


The division of labor between system researchers building better tools and statistical modelers 
building better neural networks has greatly simplified things. For instance, training a linear lo- 
gistic regression model used to be a nontrivial homework problem, worthy to give to new machine 
learning Ph.D. students at Carnegie Mellon University in 2014. By now, this task can be accom- 
plished with less than 10 lines of code, putting it firmly into the grasp of programmers. 


1.6 Success Stories 


Al has a long history of delivering results that would be difficult to accomplish otherwise. For in- 
stance, the mail sorting systems using optical character recognition have been deployed since the 
1990s. This is, after all, the source of the famous MNIST dataset of handwritten digits. The same 
applies to reading checks for bank deposits and scoring creditworthiness of applicants. Financial 
transactions are checked for fraud automatically. This forms the backbone of many e-commerce 
payment systems, such as PayPal, Stripe, AliPay, WeChat, Apple, Visa, and MasterCard. Computer 
programs for chess have been competitive for decades. Machine learning feeds search, recom- 
mendation, personalization, and ranking on the Internet. In other words, machine learning is 
pervasive, albeit often hidden from sight. 


It is only recently that AI has been in the limelight, mostly due to solutions to problems that were 
considered intractable previously and that are directly related to consumers. Many of such ad- 
vances are attributed to deep learning. 


e Intelligent assistants, such as Apple’s Siri, Amazon's Alexa, and Google's assistant, are able to 
answer spoken questions with a reasonable degree of accuracy. This includes menial tasks 
such as turning on light switches (a boon to the disabled) up to making barber's appoint- 
ments and offering phone support dialog. This is likely the most noticeable sign that Al is 
affecting our lives. 


A key ingredient in digital assistants is the ability to recognize speech accurately. Gradually 
the accuracy of such systems has increased to the point where they reach human parity for 
certain applications (Xiong et al., 2018). 


Object recognition likewise has come a long way. Estimating the object in a picture was a 
fairly challenging task in 2010. On the ImageNet benchmark researchers from NEC Labs 
and University of Illinois at Urbana-Champaign achieved a top-5 error rate of 28% (Lin et 
al., 2010). By 2017, this error rate was reduced to 2.25% (Hu et al., 2018). Similarly, stunning 
results have been achieved for identifying birds or diagnosing skin cancer. 


Games used to be a bastion of human intelligence. Starting from TD-Gammon, a program 
for playing backgammon using temporal difference reinforcement learning, algorithmic 
and computational progress has led to algorithms for a wide range of applications. Unlike 
backgammon, chess has a much more complex state space and set of actions. DeepBlue beat 
Garry Kasparov using massive parallelism, special-purpose hardware and efficient search 
through the game tree (Campbell et al., 2002). Go is more difficult still, due to its huge state 
space. AlphaGo reached human parity in 2015, using deep learning combined with Monte 
Carlo tree sampling (Silver et al., 2016). The challenge in Poker was that the state space is 





* https://github.com/pytorch/pytorch 
3 https://github.com/apache/incubator-mxnet 
3 https://github.com/google/jax 
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large and it is not fully observed (we do not know the opponents' cards). Libratus exceeded 
human performance in Poker using efficiently structured strategies (Brown € Sandholm, 
2017). This illustrates the impressive progress in games and the fact that advanced algo- 
rithms played a crucial part in them. 


Another indication of progress in Al is the advent of self-driving cars and trucks. While 
full autonomy is not quite within reach yet, excellent progress has been made in this direc- 
tion, with companies such as Tesla, NVIDIA, and Waymo shipping products that enable at 
least partial autonomy. What makes full autonomy so challenging is that proper driving re- 
quires the ability to perceive, to reason and to incorporate rules into a system. At present, 
deep learning is used primarily in the computer vision aspect of these problems. The rest is 
heavily tuned by engineers. 


Again, the above list barely scratches the surface of where machine learning has impacted prac- 
tical applications. For instance, robotics, logistics, computational biology, particle physics, and 
astronomy owe some of their most impressive recent advances at least in parts to machine learn- 
ing. Machine learning is thus becoming a ubiquitous tool for engineers and scientists. 


Frequently, the question of the AI apocalypse, or the AI singularity has been raised in non- 
technical articles on AI. The fear is that somehow machine learning systems will become sentient 
and decide independently from their programmers (and masters) about things that directly af- 
fect the livelihood of humans. To some extent, AI already affects the livelihood of humans in an 
immediate way: creditworthiness is assessed automatically, autopilots mostly navigate vehicles, 
decisions about whether to grant bail use statistical data as input. More frivolously, we can ask 
Alexa to switch on the coffee machine. 


Fortunately, we are far from a sentient AI system that is ready to manipulate its human creators 
(or burn their coffee). First, AI systems are engineered, trained and deployed in a specific, goal- 
oriented manner. While their behavior might give the illusion of general intelligence, it is a com- 
bination of rules, heuristics and statistical models that underlie the design. Second, at present 
tools for artificial general intelligence simply do not exist that are able to improve themselves, rea- 
son about themselves, and that are able to modify, extend, and improve their own architecture 
while trying to solve general tasks. 


A much more pressing concern is how Al is being used in our daily lives. It is likely that many 
menial tasks fulfilled by truck drivers and shop assistants can and will be automated. Farm robots 
will likely reduce the cost for organic farming but they will also automate harvesting operations. 
This phase of the industrial revolution may have profound consequences on large swaths of soci- 
ety, since truck drivers and shop assistants are some of the most common jobs in many countries. 
Furthermore, statistical models, when applied without care can lead to racial, gender, or age bias 
and raise reasonable concerns about procedural fairness if automated to drive consequential de- 
cisions. It is important to ensure that these algorithms are used with care. With what we know 
today, this strikes us a much more pressing concern than the potential of malevolent superintel- 
ligence to destroy humanity. 
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1.7 Characteristics 


Thus far, we have talked about machine learning broadly, which is both a branch of Al and an ap- 
proach to AI. Though deep learning is a subset of machine learning, the dizzying set of algorithms 
and applications makes it difficult to assess what specifically the ingredients for deep learning 
might be. This is as difficult as trying to pin down required ingredients for pizza since almost 
every component is substitutable. 


As we have described, machine learning can use data to learn transformations between inputs 
and outputs, such as transforming audio into text in speech recognition. In doing so, it is often 
necessary to represent data in a way suitable for algorithms to transform such representations 
into the output. Deep learning is deep in precisely the sense that its models learn many layers of 
transformations, where each layer offers the representation at one level. For example, layers near 
the input may represent low-level details of the data, while layers closer to the classification output 
may represent more abstract concepts used for discrimination. Since representation learning aims 
at finding the representation itself, deep learning can be referred to as multi-level representation 
learning. 


The problems that we have discussed so far, such as learning from the raw audio signal, the raw 
pixel values of images, or mapping between sentences of arbitrary lengths and their counterparts 
in foreign languages, are those where deep learning excels and where traditional machine learn- 
ing methods falter. It turns out that these many-layered models are capable of addressing low- 
level perceptual data in a way that previous tools could not. Arguably the most significant com- 
monality in deep learning methods is the use of end-to-end training. Thatis, rather than assembling 
a system based on components that are individually tuned, one builds the system and then tunes 
their performance jointly. For instance, in computer vision scientists used to separate the process 
of feature engineering from the process of building machine learning models. The Canny edge de- 
tector (Canny, 1987) and Lowe's SIFT feature extractor (Lowe, 2004) reigned supreme for over a 
decade as algorithms for mapping images into feature vectors. In bygone days, the crucial part of 
applying machine learning to these problems consisted of coming up with manually-engineered 
ways of transforming the data into some form amenable to shallow models. Unfortunately, there 
is only so little that humans can accomplish by ingenuity in comparison with a consistent eval- 
uation over millions of choices carried out automatically by an algorithm. When deep learning 
took over, these feature extractors were replaced by automatically tuned filters, yielding superior 
accuracy. 


Thus, one key advantage of deep learning is that it replaces not only the shallow models at the 
end of traditional learning pipelines, but also the labor-intensive process of feature engineering. 
Moreover, by replacing much of the domain-specific preprocessing, deep learning has eliminated 
many of the boundaries that previously separated computer vision, speech recognition, natural 
language processing, medical informatics, and other application areas, offering a unified set of 
tools for tackling diverse problems. 


Beyond end-to-end training, we are experiencing a transition from parametric statistical descrip- 
tions to fully nonparametric models. When data are scarce, one needs to rely on simplifying as- 
sumptions about reality in order to obtain useful models. When data are abundant, this can be 
replaced by nonparametric models that fit reality more accurately. To some extent, this mirrors 
the progress that physics experienced in the middle of the previous century with the availability 
of computers. Rather than solving parametric approximations of how electrons behave by hand, 
one can now resort to numerical simulations of the associated partial differential equations. This 
has led to much more accurate models, albeit often at the expense of explainability. 


Another difference to previous work is the acceptance of suboptimal solutions, dealing with non- 
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convex nonlinear optimization problems, and the willingness to try things before proving them. 
This newfound empiricism in dealing with statistical problems, combined with a rapid influx of 
talent has led to rapid progress of practical algorithms, albeit in many cases at the expense of 
modifying and re-inventing tools that existed for decades. 


In the end, the deep learning community prides itself of sharing tools across academic and cor- 
porate boundaries, releasing many excellent libraries, statistical models, and trained networks as 
open source. It is in this spirit that the notebooks forming this book are freely available for distri- 
bution and use. We have worked hard to lower the barriers of access for everyone to learn about 
deep learning and we hope that our readers will benefit from this. 


Summary 


Machine learning studies how computer systems can leverage experience (often data) to 
improve performance at specific tasks. It combines ideas from statistics, data mining, and 
optimization. Often, itis used as a means of implementing Al solutions. 


As a class of machine learning, representational learning focuses on how to automatically 
find the appropriate way to represent data. Deep learning is multi-level representation 
learning through learning many layers of transformations. 


Deep learning replaces not only the shallow models at the end of traditional machine learn- 
ing pipelines, but also the labor-intensive process of feature engineering. 


Much of the recent progress in deep learning has been triggered by an abundance of data 
arising from cheap sensors and Internet-scale applications, and by significant progress in 
computation, mostly through GPUs. 


Whole system optimization is a key component in obtaining high performance. The avail- 
ability of efficient deep learning frameworks has made design and implementation of this 
significantly easier. 


Exercises 


1. 


Which parts of code that you are currently writing could be “learned”, i.e., improved by 
learning and automatically determining design choices that are made in your code? Does 
your code include heuristic design choices? 


. Which problems that you encounter have many examples for how to solve them, yet no spe- 


cific way to automate them? These may be prime candidates for using deep learning. 


. Viewing the development of AI as a new industrial revolution, what is the relationship be- 


tween algorithms and data? Is it similar to steam engines and coal? Whatis the fundamental 
difference? 


. Where else can you apply the end-to-end training approach, such as in Fig. 1.1.2, physics, 


engineering, and econometrics? 


Discussions?” 





7 https://discuss.d21.ai/t/22 
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Chapter 1. Introduction 


2 Preliminaries 


To get started with deep learning, we will need to develop a few basic skills. All machine learning 
is concerned with extracting information from data. So we will begin by learning the practical 
skills for storing, manipulating, and preprocessing data. 


Moreover, machine learning typically requires working with large datasets, which we can think 
of as tables, where the rows correspond to examples and the columns correspond to attributes. 
Linear algebra gives us a powerful set of techniques for working with tabular data. We will not go 
too far into the weeds but rather focus on the basic of matrix operations and their implementation. 


Additionally, deep learning is all about optimization. We have a model with some parameters and 
we want to find those that fit our data the best. Determining which way to move each parameter at 
each step of an algorithm requires a little bit of calculus, which will be briefly introduced. Fortu- 
nately, the autograd package automatically computes differentiation for us, and we will cover it 
next. 


Next, machine learning is concerned with making predictions: what is the likely value of some un- 
known attribute, given the information that we observe? To reason rigorously under uncertainty 
we will need to invoke the language of probability. 


In the end, the official documentation provides plenty of descriptions and examples that are be- 
yond this book. To conclude the chapter, we will show you how to look up documentation for the 
needed information. 


This book has kept the mathematical content to the minimum necessary to get a proper under- 
standing of deep learning. However, it does not mean that this book is mathematics free. Thus, 
this chapter provides a rapid introduction to basic and frequently-used mathematics to allow any- 
one to understand at least most of the mathematical content of the book. If you wish to understand 
all of the mathematical content, further reviewing the online appendix on mathematics’? should 
be sufficient. 


2.1 Data Manipulation 


In order to get anything done, we need some way to store and manipulate data. Generally, there 
are two important things we need to do with data: (i) acquire them; and (ii) process them once they 
are inside the computer. There is no point in acquiring data without some way to store it, so let us 
get our hands dirty first by playing with synthetic data. To start, we introduce the n-dimensional 
array, which is also called the tensor. 


If you have worked with NumPy, the most widely-used scientific computing package in Python, 
then you will find this section familiar. No matter which framework you use, its tensor class 
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(ndarray in MXNet, Tensor in both PyTorch and TensorFlow) is similar to NumPy’s ndarray with 
a few killer features. First, GPU is well-supported to accelerate the computation whereas NumPy 
only supports CPU computation. Second, the tensor class supports automatic differentiation. 
These properties make the tensor class suitable for deep learning. Throughout the book, when 
we say tensors, we are referring to instances of the tensor class unless otherwise stated. 


2.1.1 Getting Started 


In this section, we aim to get you up and running, equipping you with the basic math and numer- 
ical computing tools that you will build on as you progress through the book. Do not worry if you 
struggle to grok some of the mathematical concepts or library functions. The following sections 
will revisit this material in the context of practical examples and it will sink. On the other hand, 
if you already have some background and want to go deeper into the mathematical content, just 
skip this section. 


To start, we import the np (numpy) and npx (numpy_extension) modules from MXNet. Here, the np 
module includes functions supported by NumPy, while the npx module contains a set of extensions 
developed to empower deep learning within a NumPy-like environment. When using tensors, we 
almost always invoke the set_np function: this is for compatibility of tensor processing by other 
components of MXNet. 


from mxnet import np, npx 
npx.set_np() 


A tensor represents a (possibly multi-dimensional) array of numerical values. With one axis, a 
tensor corresponds (in math) to a vector. With two axes, a tensor corresponds to a matrix. Tensors 
with more than two axes do not have special mathematical names. 


To start, we can use arange to create a row vector x containing the first 12 integers starting with 0, 
though they are created as floats by default. Each of the values in a tensor is called an element of 
the tensor. For instance, there are 12 elements in the tensor x. Unless otherwise specified, a new 
tensor will be stored in main memory and designated for CPU-based computation. 


x = np.arange(12) 
x 


ari Osp lop Zep Sep os Ses Gap Tas Bes Oar Oes 1T 


We can access a tensor’s shape (the length along each axis) by inspecting its shape property. 


x. shape 


(12,) 


If we just want to know the total number of elements in a tensor, i.e., the product of all of the shape 
elements, we can inspect its size. Because we are dealing with a vector here, the single element 
of its shape is identical to its size. 


x.size 
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12 


To change the shape of a tensor without altering either the number of elements or their values, we 
can invoke the reshape function. For example, we can transform our tensor, x, from a row vector 
with shape (12,) to a matrix with shape (3, 4). This new tensor contains the exact same values, but 
views them as a matrix organized as 3 rows and 4 columns. To reiterate, although the shape has 
changed, the elements have not. Note that the size is unaltered by reshaping. 


X = x.reshape(3, 4) 
X 


antav Eo Boy Sod], 
AEE ar 
E Ba Do p 11-11 


Reshaping by manually specifying every dimension is unnecessary. If our target shape is a ma- 
trix with shape (height, width), then after we know the width, the height is given implicitly. Why 
should we have to perform the division ourselves? In the example above, to get a matrix with 3 
rows, we specified both that it should have 3 rows and 4 columns. Fortunately, tensors can au- 
tomatically work out one dimension given the rest. We invoke this capability by placing -1 for 
the dimension that we would like tensors to automatically infer. In our case, instead of calling 
x.reshape(3, 4), we could have equivalently called x.reshape(-1, 4) or x.reshape(3, -1). 


Typically, we will want our matrices initialized either with zeros, ones, some other constants, or 
numbers randomly sampled from a specific distribution. We can create a tensor representing a 
tensor with all elements set to 0 and a shape of (2, 3, 4) as follows: 


np.zeros((2, 3, 4)) 


angay @ NA 
oss Osy Ooo OT; 
[0., 0., 0., 0.11, 
Elo Oop Oro Oel 
[0., 0.,0., 0.], 
Bosa Oon Oa, 0I 


Similarly, we can create tensors with each element set to 1 as follows: 


np.ones((2, 3, 4)) 


anra TEES tog dos Mello 
A etal 
e ee ees jaa 
Hid eee eae a 
ited le 
Eiss tes ioa 1I 


Often, we want to randomly sample the values for each element in a tensor from some probability 
distribution. For example, when we construct arrays to serve as parameters in a neural network, 
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we will typically initialize their values randomly. The following snippet creates a tensor with shape 
(3, 4). Each of its elements is randomly sampled from a standard Gaussian (normal) distribution 
with a mean of 0 and a standard deviation of 1. 


np.random.normal(0, 1, size=(3, 4)) 
array([[ 2.2122064 , 1.1630787 , 0.7740038 , 0.4838046 1, 


[ 1.0434405 , 0.29956347, 1.1839255 , 0.153025461, 
E 1.8917114 , -1.1688148 , -1.2347414 , 1.5580711 ]]) 


We can also specify the exact values for each element in the desired tensor by supplying a Python 
list (or list of lists) containing the numerical values. Here, the outermost list corresponds to axis 
0, and the inner list to axis 1. 


npranray Ure e 1, 4, Bl, El, 2, 3, 4, [5 Sy 2, UI 


array TIZ o tay oo Soll, 
Dp Boe Son ls 
Moe Slog Loy Moda) 


2.1.2 Operations 


This book is not about software engineering. Our interests are not limited to simply reading and 
writing data from/to arrays. We want to perform mathematical operations on those arrays. Some 
of the simplest and most useful operations are the elementwise operations. These apply a stan- 
dard scalar operation to each element of an array. For functions that take two arrays as inputs, 
elementwise operations apply some standard binary operator on each pair of corresponding ele- 
ments from the two arrays. We can create an elementwise function from any function that maps 
from a scalar to a scalar. 


In mathematical notation, we would denote such a unary scalar operator (taking one input) by the 
signature f : R > R. This just means that the function is mapping from any real number (R) onto 
another. Likewise, we denote a binary scalar operator (taking two real inputs, and yielding one 
output) by the signature f : R,R — R. Given any two vectors u and v of the same shape, and a binary 
operator f, we can produce a vector c = F(u, v) by setting c; + f(u;, vi) for all i, where c;, u;, and 
v; are the ¡' elements of vectors c, u, and v. Here, we produced the vector-valued F : R¢, RA — RA 
by lifting the scalar function to an elementwise vector operation. 


The common standard arithmetic operators (+, -, x, /, and xx*) have all been lifted to element- 
wise operations for any identically-shaped tensors of arbitrary shape. We can call elementwise 
operations on any two tensors of the same shape. In the following example, we use commas to 
formulate a 5-element tuple, where each element is the result of an elementwise operation. 


x = np.array([1, 2, 4, 8]) 
y = np.array(L2, 2, 2, 2]) 
x+y, X-y, x * y, X/ y, Xxx y # The ** operator is exponentiation 


(array 85, Loy O) 1010) 
Teal. Dee Boy Golly 


(continues on next page) 
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(continued from previous page) 
AGENCE 2. Lor Es 101), 
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Many more operations can be applied elementwise, including unary operators like exponentia- 
tion. 


np.exp(x) 
array([2.7182817e+00, 7.3890562e+00, 5.4598148e+01, 2.9809580e+03]) 


In addition to elementwise computations, we can also perform linear algebra operations, includ- 
ing vector dot products and matrix multiplication. We will explain the crucial bits of linear algebra 
(with no assumed prior knowledge) in Section 2.3. 


We can also concatenate multiple tensors together, stacking them end-to-end to form a larger ten- 
sor. We just need to provide a list of tensors and tell the system along which axis to concatenate. 
The example below shows what happens when we concatenate two matrices along rows (axis 0, 
the first element of the shape) vs. columns (axis 1, the second element of the shape). We can see 
that the first output tensor's axis-0 length (6) is the sum of the two input tensors’ axis-0 lengths 
(3 +3); while the second output tensor’s axis-1 length (8) is the sum of the two input tensors’ axis-1 
lengths (4 + 4). 


X = np.arange(12).reshape(3, 4) 
Mp array (LP AS a 4 lee (FA a la) i) 
np.concatenate([X, Y], axis=0), np.concatenate([X, Y], axis=1) 


(Carrey ir O. tes Bey Sol, 
E E da le 
E Boa Ses la" 
De AT T 
E E erent eal 
E des Ses 2 liiy, 
arar E IP 
MO a Ta a a 3 A 
E Gea Oro Map Mos a Jep yo 


Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example. 
For each position, if X and Y are equal at that position, the corresponding entry in the new tensor 
takes a value of 1, meaning that the logical statement X == Y is true at that position; otherwise that 
position takes 0. 


o=— 


array([[False, True, False, True], 
[False, False, False, False], 
[False, False, False, False]]) 


Summing all the elements in the tensor yields a tensor with only one element. 
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X.sum() 


array(66.) 


2.1.3 Broadcasting Mechanism 


In the above section, we saw how to perform elementwise operations on two tensors of the same 
shape. Under certain conditions, even when shapes differ, we can still perform elementwise op- 
erations by invoking the broadcasting mechanism. This mechanism works in the following way: 
First, expand one or both arrays by copying elements appropriately so that after this transforma- 
tion, the two tensors have the same shape. Second, carry out the elementwise operations on the 
resulting arrays. 


In most cases, we broadcast along an axis where an array initially only has length 1, such as in the 
following example: 


a = np.arange(3).reshape(3, 1) 
b = np.arange(2).reshape(1, 2) 
a, b 


(array([L[0.], 
El. dl, 
21D). 
array(L[0., 1.]])) 


Since a and b are 3 x 1 and 1 x 2 matrices respectively, their shapes do not match up if we want 
to add them. We broadcast the entries of both matrices into a larger 3 x 2 matrix as follows: for 
matrix a it replicates the columns and for matrix b it replicates the rows before adding up both 
elementwise. 


a+b 

array([[0., 1.1, 
MASA 
[sy dll) 


2.1.4 Indexing and Slicing 


Just as in any other Python array, elements in a tensor can be accessed by index. As in any Python 
array, the first element has index 0 and ranges are specified to include the first but before the last 
element. As in standard Python lists, we can access elements according to their relative position 
to the end of the list by using negative indices. 


Thus, [-1] selects the last element and [1: 3] selects the second and the third elements as follows: 


X[-1], X[1:31] 
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(anra dos Dos IOs, MI, 
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E Bea O) 


Beyond reading, we can also write elements of a matrix by specifying indices. 


X[1, 2] = 9 

X 

array PRAT ts, Boy Bol, 
[A O a 
E Bos Doa 1O 1T 


If we want to assign multiple elements the same value, we simply index all of them and then assign 
them the value. For instance, [0:2, :] accesses the first and second rows, where : takes all the 
elements along axis 1 (column). While we discussed indexing for matrices, this obviously also 
works for vectors and for tensors of more than 2 dimensions. 


X[0:2, :] = 12 
X 


ENTENDI: y W2s5 Io, aN 
A cil See T 
E Gaa Do p 11.11 


2.1.5 Saving Memory 


Running operations can cause new memory to be allocated to host results. For example, if we 
write Y = X + Y, we will dereference the tensor that Y used to point to and instead point Y at 
the newly allocated memory. In the following example, we demonstrate this with Python’s id() 
function, which gives us the exact address of the referenced object in memory. After running Y = 
Y + X, we will find that id(Y) points to a different location. That is because Python first evaluates Y 
+ X, allocating new memory for the result and then makes Y point to this new location in memory. 


before = id(Y) 
Y=Y+X 
id(Y) == before 


False 


This might be undesirable for two reasons. First, we do not want to run around allocating mem- 
ory unnecessarily all the time. In machine learning, we might have hundreds of megabytes of 
parameters and update all of them multiple times per second. Typically, we will want to perform 
these updates in place. Second, we might point at the same parameters from multiple variables. 
If we do not update in place, other references will still point to the old memory location, making 
it possible for parts of our code to inadvertently reference stale parameters. 


Fortunately, performing in-place operations is easy. We can assign the result of an operation to 
a previously allocated array with slice notation, e.g., Y[:] = <expression>. To illustrate this 
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concept, we first create a new matrix Z with the same shape as another Y, using zeros_like to 
allocate a block of 0 entries. 


Z = np.zeros_like(Y) 
PEINE ACA ALO (EA) 
Al = Y 
PRINTS) O (E) 


id(Z): 140356853688512 
id(Z): 140356853688512 


If the value of X is not reused in subsequent computations, we can also use X[:] = X + YorX += 
Y to reduce the memory overhead of the operation. 


before = id(X) 


X += Y 
id(X) == before 


True 


2.1.6 Conversion to Other Python Objects 


Converting to a NumPy tensor, or vice versa, is easy. The converted result does not share memory. 
This minor inconvenience is actually quite important: when you perform operations on the CPU 
or on GPUs, you do not want to halt computation, waiting to see whether the NumPy package of 
Python might want to be doing something else with the same chunk of memory. 


A = X.asnumpy() 


B = np.array(A) 
type(A), type(B) 


(numpy.ndarray, mxnet.numpy.ndarray) 


To convert a size-1 tensor to a Python scalar, we can invoke the item function or Pythox's built-in 
functions. 


a = np.array([3.5]) 
a, a.item(), float(a), int(a) 


(angay BASD 3.5, 3.5, 8) 
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Summary 


e The main interface to store and manipulate data for deep learning is the tensor (n- 
dimensional array). It provides a variety of functionalities including basic mathematics op- 
erations, broadcasting, indexing, slicing, memory saving, and conversion to other Python 
objects. 


Exercises 


1. Run the code in this section. Change the conditional statement X == Y in this section to X < 
Y or X > Y, and then see what kind of tensor you can get. 


2. Replace the two tensors that operate by element in the broadcasting mechanism with other 
shapes, e.g., 3-dimensional tensors. Is the result the same as expected? 


Discussions?” 


2.2 Data Preprocessing 


So far we have introduced a variety of techniques for manipulating data that are already stored in 
tensors. To apply deep learning to solving real-world problems, we often begin with preprocess- 
ing raw data, rather than those nicely prepared data in the tensor format. Among popular data 
analytic tools in Python, the pandas package is commonly used. Like many other extension pack- 
ages in the vast ecosystem of Python, pandas can work together with tensors. So, we will briefly 
walk through steps for preprocessing raw data with pandas and converting them into the tensor 
format. We will cover more data preprocessing techniques in later chapters. 


2.2.1 Reading the Dataset 


As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated 
values) file ../data/house_tiny.csv. Data stored in other formats may be processed in similar 
ways. 


Below we write the dataset row by row into a csv file. 


import os 
os.makedirs(os.path.join('..', 'data'), exist_ok=True) 
data_file = os.path.join(’..’, 'data', 'house_tiny.csv') 
with open(data_file, 'w') as f: 
f.write('NumRooms,Alley,PriceWn') # Column names 
f.write('NA,Pave,1275001n') + Each row represents a data example 
f.write('2,NA,1060001n') 
f .write(’4,NA,178100\n') 
f .write('NA,NA,140000\n') 


To load the raw dataset from the created csv file, we import the pandas package and invoke the 
read_csv function. This dataset has four rows and three columns, where each row describes the 
number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house. 





3% https://discuss.d21.ai/t/26 
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# If pandas is not installed, just uncomment the following line: 
# !pip install pandas 
import pandas as pd 


data = pd.read_csv(data_file) 
print(data) 


NumRooms Alley Price 
NaN Pave 127500 
2.0 NaN 106000 
4.0 NaN 178100 
NaN NaN 140000 


WN KH © 


2.2.2 Handling Missing Data 


Note that “NaN” entries are missing values. To handle missing data, typical methods include im- 
putation and deletion, where imputation replaces missing values with substituted ones, while dele- 
tion ignores missing values. Here we will consider imputation. 


By integer-location based indexing (iloc), we split datainto inputs and outputs, where the former 
takes the first two columns while the latter only keeps the last column. For numerical values in 
inputs that are missing, we replace the “NaN” entries with the mean value of the same column. 


inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2] 


inputs = inputs.fillna(inputs.mean()) 
print(inputs) 


NumRooms Alley 


0 3.0 Pave 
1 2.0 NaN 
2 4.0 NaN 
3 3.0 NaN 


For categorical or discrete values in inputs, we consider “NaN” as a category. Since the “Alley” 
column only takes two types of categorical values “Pave” and “NaN”, pandas can automatically 
convert this column to two columns “Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave” 
will set values of “Alley_Pave” and “Alley_nan” to 1 and 0. A row with a missing alley type will set 
their values to 0 and 1. 


inputs = pd.get_dummies(inputs, dummy_na=True) 
print(inputs) 


NumRooms Alley_Pave Alley_nan 
3.0 il 0 


w Ne O 


250) 0 1 
4.0 0 1 
SO 0 1 
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2.2.3 Conversion to the Tensor Format 


Now that all the entries in inputs and outputs are numerical, they can be converted to the tensor 
format. Once data are in this format, they can be further manipulated with those tensor function- 
alities that we have introduced in Section 2.1. 


from mxnet import np 


X, y = np.array(inputs.values), np.array(outputs.values) 
X, y 
(array did.» oy Dad, 
E eee balla 
Pioa Oro lal 
[3., 0., 1.]], dtype=float64), 
array([127500, 106000, 178100, 140000], dtype=int64)) 


Summary 
+ Like many other extension packages in the vast ecosystem of Python, pandas can work to- 
gether with tensors. 


* Imputation and deletion can be used to handle missing data. 


Exercises 


Create a raw dataset with more rows and columns. 
1. Delete the column with the most missing values. 
2. Convert the preprocessed dataset to the tensor format. 


Discussions? 


2.3 Linear Algebra 


Now that you can store and manipulate data, let us briefly review the subset of basic linear algebra 
that you will need to understand and implement most of models covered in this book. Below, we 
introduce the basic mathematical objects, arithmetic, and operations in linear algebra, expressing 
each of them through mathematical notation and the corresponding implementation in code. 





“ https://discuss.d21.ai/t/28 
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2.3.1 Scalars 


If you never studied linear algebra or machine learning, then your past experience with math 
probably consisted of thinking about one number at a time. And, if you ever balanced a check- 
book or even paid for dinner at a restaurant then you already know how to do basic things like 
adding and multiplying pairs of numbers. For example, the temperature in Palo Alto is 52 de- 
grees Fahrenheit. Formally, we call values consisting of just one numerical quantity scalars. If 
you wanted to convert this value to Celsius (the metric system’s more sensible temperature scale), 
you would evaluate the expression c = 3 ( f — 32), setting f to 52. In this equation, each of the 
terms—5, 9, and 32—are scalar values. The placeholders c and f are called variables and they rep- 
resent unknown scalar values. 


In this book, we adopt the mathematical notation where scalar variables are denoted by ordinary 
lower-cased letters (e.g., x, y, and z). We denote the space of all (continuous) real-valued scalars 
by R. For expedience, we will punt on rigorous definitions of what precisely space is, but just 
remember for now that the expression x € R is a formal way to say that x is a real-valued scalar. 
The symbol € can be pronounced “in” and simply denotes membership in a set. Analogously, we 
could write x, y € {0,1} to state that x and y are numbers whose value can only be 0 or 1. 


A scalar is represented by a tensor with just one element. In the next snippet, we instantiate two 
scalars and perform some familiar arithmetic operations with them, namely addition, multipli- 
cation, division, and exponentiation. 


from mxnet import np, npx 
npx.set_np() 


x = np.array(3.0) 
y = np.array(2.0) 


XFY) X* Y, X/ y, X *x* y 


(array(5.), array(6.), array(1.5), array(9.)) 


2.3.2 Vectors 


You can think of a vector as simply a list of scalar values. We call these values the elements (entries 
or components) of the vector. When our vectors represent examples from our dataset, their values 
hold some real-world significance. For example, ifwe were training a model to predictthe riskthat 
a loan defaults, we might associate each applicant with a vector whose components correspond 
to their income, length of employment, number of previous defaults, and other factors. If we 
were studying the risk of heart attacks hospital patients potentially face, we might represent each 
patient by a vector whose components capture their most recent vital signs, cholesterol levels, 
minutes of exercise per day, etc. In math notation, we will usually denote vectors as bold-faced, 
lower-cased letters (e.g., x, y, and z). 


We work with vectors via one-dimensional tensors. In general tensors can have arbitrary lengths, 
subject to the memory limits of your machine. 


x = np.arange(4) 
x 
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We can refer to any element of a vector by using a subscript. For example, we can refer to the ¡$ 
element of x by x;. Note that the element z; is a scalar, so we do not bold-face the font when refer- 
ring to it. Extensive literature considers column vectors to be the default orientation of vectors, 
so does this book. In math, a vector x can be written as 


Tı 
T2 
x=|. |, (2.3.1) 
[zn] 

where z1, ..., £n are elements of the vector. In code, we access any element by indexing into the 
tensor. 
x[3] 
array(3.) 


Length, Dimensionality, and Shape 


Let us revisit some concepts from Section 2.1. A vector is just an array of numbers. And just as 
every array has a length, so does every vector. In math notation, if we want to say that a vector x 
consists of n real-valued scalars, we can express this as x € R”. The length of a vector is commonly 
called the dimension of the vector. 


As with an ordinary Python array, we can access the length of a tensor by calling Python's built-in 
len() function. 


len(x) 


When a tensor represents a vector (with precisely one axis), we can also access its length via the 
. shape attribute. The shape is a tuple that lists the length (dimensionality) along each axis of the 
tensor. For tensors with just one axis, the shape has just one element. 


x. shape 


(4,) 


Note that the word “dimension” tends to get overloaded in these contexts and this tends to confuse 
people. To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the 
number of elements of a vector or an axis. However, we use the dimensionality of a tensor to refer 
to the number of axes that a tensor has. In this sense, the dimensionality of some axis of a tensor 
will be the length of that axis. 
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2.3.3 Matrices 


Just as vectors generalize scalars from order zero to order one, matrices generalize vectors from 
order one to order two. Matrices, which we will typically denote with bold-faced, capital letters 
(e.g., X, Y, and Z), are represented in code as tensors with two axes. 


In math notation, we use A € R”*” to express that the matrix A consists of m rows and n columns 
of real-valued scalars. Visually, we can illustrate any matrix A € R”*” as a table, where each 
element a;; belongs to the i row and j column: 


a11 Q12 **: Gin 
a21 a22 °°" Gan (2 3 2) 
E Gm2 `: hal 


For any A € R”*”, the shape of A is (m, n) or m x n. Specifically, when a matrix has the same 
number of rows and columns, its shape becomes a square; thus, itis called a square matrix. 


We can create an m x n matrix by specifying a shape with two components m and n when calling 
any of our favorite functions for instantiating a tensor. 


A = np.arange(20).reshape(5, 4) 

A 

angay CUL Gag Mor Bog Bel, 
E dle 
E a ale 
EZ, We, I, S.J, 
la. Woo Wo, MEL IN) 


We can access the scalar element a;; of a matrix A in (2.3.2) by specifying the indices for the row 
(i) and column (j), such as [A];;. When the scalar elements of a matrix A, such as in (2.3.2), are not 
given, we may simply use the lower-case letter of the matrix A with the index subscript, a;;, to refer 
to [A];;. To keep notation simple, commas are inserted to separate indices only when necessary, 
such as 042,35 and [AJ2;-1,3- 


Sometimes, we want to flip the axes. When we exchange a matrix's rows and columns, the result is 
called the transpose of the matrix. Formally, we signify a matrix A's transpose by A' andifB=A', 
then b;; = az; for any i and j. Thus, the transpose of A in (2.3.2) isan x m matrix: 


Q11 091 oes Am1 
AT Q12 Q22 ... Am2 


la, AIR ~. don] 


Now we access a matrix's transpose in code. 


(2.3.3) 


A.T 


Ea 
, a e la 
pelts Man Welo 
ao Tiag Iag E 


(y i ba bd 

w N e O 

NO f 
o 
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As a special type of the square matrix, a symmetric matrix A is equal to its transpose: A = A'. Here 
we define a symmetric matrix B. 


B = np.array([[1, 2, 31, [2, 0, 4], [3, 4, 511) 
B 


Erre Qi... Lo Sol, 
E e dle 
[og log Dal 


Now we compare B with its transpose. 


B == B.T 


array([[ True, True, True], 
[ True, True, True], 
[ True, True, True]]) 


Matrices are useful data structures: they allow us to organize data that have different modalities 
of variation. For example, rows in our matrix might correspond to different houses (data exam- 
ples), while columns might correspond to different attributes. This should sound familiar if you 
have ever used spreadsheet software or have read Section 2.2. Thus, although the default orienta- 
tion of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more 
conventional to treat each data example as a row vector in the matrix. And, as we will see in later 
chapters, this convention will enable common deep learning practices. For example, along the 
outermost axis of a tensor, we can access or enumerate minibatches of data examples, or just data 
examples if no minibatch exists. 


2.3.4 Tensors 


Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures 
with even more axes. Tensors (“tensors” in this subsection refer to algebraic objects) give us a 
generic way of describing n-dimensional arrays with an arbitrary number of axes. Vectors, for 
example, are first-order tensors, and matrices are second-order tensors. Tensors are denoted with 
capital letters of a special font face (e.g., X, Y, and Z) and their indexing mechanism (e.g., xij and 
(X] 1 2;-1,3) is similar to that of matrices. 


Tensors will become more important when we start working with images, which arrive as n- 
dimensional arrays with 3 axes corresponding to the height, width, and a channel axis for stacking 
the color channels (red, green, and blue). For now, we will skip over higher order tensors and 
focus on the basics. 


X = np.arange(24).reshape(2, 3, 4) 
X 


E E a all 


array([[[ 0 
Ete Bop Woy Valle 
[ 8 


DEA a tele chee lle 


(continues on next page) 
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(continued from previous page) 


les We, Bay ML 
AD. Zilog 2209 2341110) 


2.3.5 Basic Properties of Tensor Arithmetic 


Scalars, vectors, matrices, and tensors (“tensors” in this subsection refer to algebraic objects) of an 
arbitrary number of axes have some nice properties that often come in handy. For example, you 
might have noticed from the definition of an elementwise operation that any elementwise unary 
operation does not change the shape of its operand. Similarly, given any two tensors with the 
same shape, the result of any binary elementwise operation will be a tensor of that same shape. 
For example, adding two matrices of the same shape performs elementwise addition over these 
two matrices. 


np.arange(20).reshape(5, 4) 
A.copy() # Assign a copy of ‘A* to ‘B* by allocating new memory 
, A+B 


>> 
I 


(array(LL 0. 
L4., 
E Boy 
Eiaa Ieo 142, 15 


O dal 

] 

] 

] 

PG. Ies Bar Wal 
] 

] 

] 

] 

] 


Bin y 
s Wry lla 


o 0 e 


aroy TTo, Pog Loy Go 
E Gra lOc ie 
El Mer Was 22 
Tso 266 Bon oO: 
2er Mer Jay oe. 


Specifically, elementwise multiplication of two matrices is called their Hadamard product (math 
notation ©). Consider matrix B € R™*” whose element of row i and column j is bij. The Hadamard 
product of matrices A (defined in (2.3.2)) and B 


aribi1  @i2b12 ...  Qinbin 
az21b21  a292b29 ... = an ban 
AOB= f . . (2.3.4) 
Aamidm1 Am2bm2 --- ca 
AxB 
array([[ 0., le 4., ee alle 


E lG 255) 362. 49.1; 
LGA. Ola 100... 121.1 
[144., 169., 196., 225.], 
[256., 289., 324., 361.]]) 


Multiplying or adding a tensor by a scalar also does not change the shape ofthe tensor, where each 
element of the operand tensor will be added or multiplied by the scalar. 
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2 
X = np.arange(24).reshape(2, 3, 4) 
a + X, (a x X).shape 


[03] 
ll 


GCrrevQdEHE Bo, Bos Zen Sed 
Bie T lene Alek 
Poss tien og 19a] 


Eiles Do, Wc, 17.1), 

Ehen 1D, BOs, Zi, 

[22.5 Per DAs, 25-111); 
(2, 3, 0) 





2.3.6 Reduction 


One useful operation that we can perform with arbitrary tensors is to calculate the sum of their 
elements. In mathematical notation, we express sums using the >” symbol. To express the sum 
of the elements in a vector x of length d, we write €.” zi. In code, we can just call the function 
for calculating the sum. 


x = np.arange(4) 
x, X.sum() 


CV. oy Lp Sa), EEN CO) 


We can express sums over the elements of tensors of arbitrary shape. For example, the sum of the 
elements of an m x n matrix A could be written > 772, > j= aij. 


A.shape, A.sum() 


((5, 4), array(190.)) 


By default, invoking the function for calculating the sum reduces a tensor along all its axes to a 
scalar. We can also specify the axes along which the tensor is reduced via summation. Take ma- 
trices as an example. To reduce the row dimension (axis 0) by summing up elements of all the 
rows, we specify axis=0 when invoking the function. Since the input matrix reduces along axis 0 
to generate the output vector, the dimension of axis 0 of the input is lost in the output shape. 


A_sum_axis® = A.sum(axis=0) 
A_sum_axis0, A_sum_axis®. shape 


(array([40., 45., 50., 55.1), (4,)) 


Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of all the 
columns. Thus, the dimension of axis 1 of the input is lost in the output shape. 


A_sum_axis1 = A.sum(axis=1) 
A_sum_axis1, A_sum_axis1.shape 
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CARAY ALS AO DA 


Reducing a matrix along both rows and columns via summation is equivalent to summing up all 
the elements of the matrix. 


A.sum(axis=[0, 11) + Same as 'A.sum()* 


array(190.) 


A related quantity is the mean, which is also called the average. We calculate the mean by dividing 
the sum by the total number of elements. In code, we could just call the function for calculating 
the mean on tensors of arbitrary shape. 


A.mean(), A.sum() / A.size 


(array(9.5), array(9.5)) 


Likewise, the function for calculating the mean can also reduce a tensor along the specified axes. 


A.mean(axis=0), A.sum(axis=0) / A.shape[0] 


CAMA IA A Y Gap O IA) 


Non-Reduction Sum 


However, sometimes it can be useful to keep the number of axes unchanged when invoking the 
function for calculating the sum or mean. 


sum_A = A.sum(axis=1, keepdims=True) 
sum_A 


array(LL 6.1, 
[224518 
[38.], 
[54.], 
[70.]]) 


For instance, since sum_A still keeps its two axes after summing each row, we can divide A by sum_A 
with broadcasting. 


A / sum_A 


array([L@. , 0.16666667, 0.33333334, 0.5 J, 
[0.18181819, 0.22727273, 0.27272728, 0.3181818 1, 
[0.21052632, 0.23684211, 0.2631579 , @.28947368], 
[0.22222222, 0.24074075, 0.25925925, @.2777778 1, 
[0.22857143, 0.24285714, 0.25714287, 0.271428591]) 
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If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by 
row), we can call the cumsum function. This function will not reduce the input tensor along any 
axis. 


A. cumsum(axis=0) 


[Y 
N 


array([[ 0. 7 3.] 
EZ Ga Se Oh 
Be elspa A] 

PA. 26. 3d, 36.1) 

[40., 45., 50., 55.11) 


2.3.7 Dot Products 


So far, we have only performed elementwise operations, sums, and averages. And if this was all 
we could do, linear algebra probably would not deserve its own section. However, one of the most 
fundamental operations is the dot product. Given two vectors x,y € R?, their dot product x! y (or 
(x, y)) is a sum over the products of the elements at the same position: x! y = 5S4 ue 


y = np.ones(4) 
x, y, mp. dot(x, y) 


Crd... te, Ze, SID, ErrEN Lo. ey Mos Tol), Serres.) 


Note that we can express the dot product of two vectors equivalently by performing an element- 
wise multiplication and then a sum: 


np.sum(x * y) 


array(6.) 


Dot products are useful in a wide range of contexts. For example, given some set of values, denoted 
by a vector x € R? and a set of weights denoted by w € IR”, the weighted sum of the values in x 
according to the weights w could be expressed as the dot product x'w. When the weights are 


non-negative and sum to one (i.e., (Eh wi = 1), the dot product expresses a weighted average. 


After normalizing two vectors to have the unit length, the dot products express the cosine of the 
angle between them. We will formally introduce this notion of length later in this section. 


2.3.8 Matrix-Vector Products 
Now that we know how to calculate dot products, we can begin to understand matrix-vector prod- 


ucts. Recall the matrix A € R”*” and the vector x € R” defined and visualized in (2.3.2) and (2.3.1) 
respectively. Let us start off by visualizing the matrix A in terms of its row vectors 


El 
A= o , (2.3.5) 
A 


2.3. Linear Algebra 61 








where each a] € R” is a row vector representing the iÈ row of the matrix A. The matrix-vector 
product Ax is simply a column vector of length m, whose i® element is the dot product aj x: 


BIEG 


Ax = xs] |. (2.3.6) 
el aie 


We can think of multiplication by a matrix A € ¡R”*” as a transformation that projects vectors 
from R” to R”. These transformations turn out to be remarkably useful. For example, we can 
represent rotations as multiplications by a square matrix. As we will see in subsequent chapters, 
we can also use matrix-vector products to describe the most intensive calculations required when 
computing each layer in a neural network given the values of the previous layer. 








Expressing matrix-vector products in code with tensors, we use the same dot function as for dot 
products. When we call np.dot(A, x) with a matrix A and a vector x, the matrix-vector product is 
performed. Note that the column dimension of A (its length along axis 1) must be the same as the 
dimension of x (its length). 


A.shape, x.shape, np.dot(A, x) 


(5 Da Co array 142, ao Os 664, MODY 


2.3.9 Matrix-Matrix Multiplication 


If you have gotten the hang of dot products and matrix-vector products, then matrix-matrix multi- 
plication should be straightforward. 


Say that we have two matrices A € R”"** and B e R*x”: 


be a2 °°: | be big = a 

a21 022 *** Uk bar b22 +++ bam 

A=|. a . |; B=|. o 4 a (2.3.7) 
Anli An2 *** Ank bki bra Toe bkm 


Denote by a; € R* the row vector representing the i row of the matrix A, and let b; € R* be the 
column vector from the ¡Y column of the matrix B. To produce the matrix product C = AB, it is 
easiest to think of A in terms of its row vectors and B in terms of its column vectors: 


a 

a) 
A=| |, B=[bi bs --- bpn]. (2.3.8) 
al | 
Then the matrix product C € R”*™ is produced as we simply compute each element c;; as the dot 


product a} b;: 
far k b; ajb --- sr] 
al alb; alb --- alb,, 
r hii ES a ee f (2.3.9) 
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We can think of the matrix-matrix multiplication AB as simply performing m matrix-vector prod- 
ucts and stitching the results together to form an n x m matrix. In the following snippet, we 
perform matrix multiplication on A and B. Here, Ais a matrix with 5 rows and 4 columns, and Bis 
a matrix with 4 rows and 3 columns. After multiplication, we obtain a matrix with 5 rows and 3 
columns. 


B = np.ones(shape=(4, 3)) 


np.dot(A, B) 

ErTeEn UE Gag Oso, Bolo 
2a 22 22, 
Tas Ees Bol 
[54., 54., 54.], 
COs, Wao, Ws AND 


Matrix-matrix multiplication can be simply called matrix multiplication, and should not be con- 
fused with the Hadamard product. 


2.3.10 Norms 


Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector 
tells us how big a vector is. The notion of size under consideration here concerns not dimension- 
ality but rather the magnitude of the components. 


In linear algebra, a vector norm is a function f that maps a vector to a scalar, satisfying a handful 
of properties. Given any vector x, the first property says that if we scale all the elements of a vector 
by a constant factor a, its norm also scales by the absolute value of the same constant factor: 


f(ax) = Jal f(x). (2.3.10) 
The second property is the familiar triangle inequality: 
f(x +y) < f(x) + fly). (2.3.11) 


The third property simply says that the norm must be non-negative: 
f(x) 0, (2.3.12) 


That makes sense, as in most contexts the smallest size for anything is 0. The final property re- 
quires that the smallest norm is achieved and only achieved by a vector consisting of all zeros. 


Vi, [x]; = 0 & f(x) = 0. (2.3.13) 


You might notice that norms sound a lot like measures of distance. And if you remember Euclidean 
distances (think Pythagoras’ theorem) from grade school, then the concepts of non-negativity and 
the triangle inequality might ring a bell. In fact, the Euclidean distance is a norm: specifically it 
is the Lə norm. Suppose that the elements in the n-dimensional vector x are z1,..., £n. The Lə 
norm of x is the square root of the sum of the squares of the vector elements: 


llxdlo = (2.3.14) 





where the subscript 2 is often omitted in Lə norms, i.e., ||x|| is equivalent to ||x||2. In code, we can 
calculate the Lə norm of a vector as follows. 
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u = np.array([3, -4]) 
np. linalg.norm(u) 


array(5.) 


In deep learning, we work more often with the squared Lə norm. You will also frequently en- 
counter the Lı norm, which is expressed as the sum of the absolute values of the vector elements: 


n 
lIxll1 => ail. (2.3.15) 
ib 


As compared with the La norm, it is less influenced by outliers. To calculate the Lı norm, we 
compose the absolute value function with a sum over the elements. 


np.abs(u).sum() 


array(7.) 


Both the Lz norm and the L; norm are special cases of the more general Lp norm: 


n 1/p 
xl = (Eir) i (2.3.16) 


i=1 


Analogous to Lə norms of vectors, the Frobenius norm of a matrix X e R™*” is the square root of 
the sum of the squares of the matrix elements: 


[Xl = (2.3.17) 





The Frobenius norm satisfies all the properties of vector norms. It behaves as if it were an Lə norm 
of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a 
matrix. 


np. linalg.norm(np.ones((4, 9))) 


array(6.) 


Norms and Objectives 


While we do not want to get too far ahead of ourselves, we can plant some intuition already about 
why these concepts are useful. In deep learning, we are often trying to solve optimization prob- 
lems: maximize the probability assigned to observed data; minimize the distance between pre- 
dictions and the ground-truth observations. Assign vector representations to items (like words, 
products, or news articles) such that the distance between similar items is minimized, and the 
distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most 
important components of deep learning algorithms (besides the data), are expressed as norms. 
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2.3.11 More on Linear Algebra 


In just this section, we have taught you all the linear algebra that you will need to understand a 
remarkable chunk of modern deep learning. There is a lot more to linear algebra and a lot of 
that mathematics is useful for machine learning. For example, matrices can be decomposed into 
factors, and these decompositions can reveal low-dimensional structure in real-world datasets. 
There are entire subfields of machine learning that focus on using matrix decompositions and 
their generalizations to high-order tensors to discover structure in datasets and solve prediction 
problems. Butthis book focuses on deep learning. And we believe you will be much more inclined 
to learn more mathematics once you have gotten your hands dirty deploying useful machine learn- 
ing models on real datasets. So while we reserve the right to introduce more mathematics much 
later on, we will wrap up this section here. 


If you are eager to learn more about linear algebra, you may refer to either the online appendix 
on linear algebraic operations*! or other excellent resources (Strang, 1993; Kolter, 2008; Petersen 
et al., 2008). 


Summary 


Scalars, vectors, matrices, and tensors are basic mathematical objects in linear algebra. 


Vectors generalize scalars, and matrices generalize vectors. 


Scalars, vectors, matrices, and tensors have zero, one, two, and an arbitrary number of axes, 
respectively. 


A tensor can be reduced along the specified axes by sum and mean. 


Elementwise multiplication of two matrices is called their Hadamard product. Itis different 
from matrix multiplication. 


In deep learning, we often work with norms such as the Lı norm, the Lə norm, and the 
Frobenius norm. 


We can perform a variety of operations over scalars, vectors, matrices, and tensors. 


Exercises 


1. Prove that the transpose of a matrix A's transpose is A: (AT)! =A. 


2. Given two matrices A and B, show that the sum of transposes is equal to the transpose of a 
sum: A! +B' =(A+B)!. 


3. Given any square matrix A, is A + A' always symmetric? Why? 
4. We defined the tensor X of shape (2, 3, 4) in this section. What is the output of len(X)? 


5. For a tensor X of arbitrary shape, does len(X) always correspond to the length of a certain 
axis of X? What is that axis? 


6. RunA / A.sum(axis=1) and see what happens. Can you analyze the reason? 


7. When traveling between two points in Manhattan, what is the distance that you need to cover 
in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally? 





# https://d21.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops. html 
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8. Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along 
axis 0, 1, and 2? 


9. Feed a tensor with 3 or more axes to the linalg.norm function and observe its output. What 
does this function compute for tensors of arbitrary shape? 


Discussions’? 


2.4 Calculus 


Finding the area of a polygon had remained mysterious until at least 2,500 years ago, when ancient 
Greeks divided a polygon into triangles and summed their areas. To find the area of curved shapes, 
such as a circle, ancient Greeks inscribed polygons in such shapes. As shown in Fig. 2.4.1, an 
inscribed polygon with more sides of equal length better approximates the circle. This process is 
also known as the method of exhaustion. 


BOOCO 


Fig. 2.4.1: Find the area of a circle with the method of exhaustion. 


In fact, the method of exhaustion is where integral calculus (will be described in Section 18.5) orig- 
inates from. More than 2,000 years later, the other branch of calculus, differential calculus, was 
invented. Among the most critical applications of differential calculus, optimization problems 
consider how to do something the best. As discussed in Section 2.3.10, such problems are ubiqui- 
tous in deep learning. 


In deep learning, we train models, updating them successively so that they get better and better 
as they see more and more data. Usually, getting better means minimizing a loss function, a score 
that answers the question “how bad is our model?” This question is more subtle than it appears. 
Ultimately, what we really care about is producing a model that performs well on data that we have 
never seen before. But we can only fit the model to data that we can actually see. Thus we can 
decompose the task of fitting models into two key concerns: i) optimization: the process of fitting 
our models to observed data; ii) generalization: the mathematical principles and practitioners’ 
wisdom that guide as to how to produce models whose validity extends beyond the exact set of 
data examples used to train them. 


To help you understand optimization problems and methods in later chapters, here we give a very 
brief primer on differential calculus that is commonly used in deep learning. 





Y https://discuss.d21.ai/t/30 
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2.4.1 Derivatives and Differentiation 


We begin by addressing the calculation of derivatives, a crucial step in nearly all deep learning 
optimization algorithms. In deep learning, we typically choose loss functions that are differen- 
tiable with respect to our model's parameters. Put simply, this means that for each parameter, 
we can determine how rapidly the loss would increase or decrease, were we to increase or decrease 
that parameter by an infinitesimally small amount. 


Suppose that we have a function f : R > R, whose input and output are both scalars. The derivative 
of f is defined as 


feia PO (2.4.1) 





if this limit exists. If f'(a) exists, f is said to be differentiable at a. If f is differentiable at every 
number of an interval, then this function is differentiable on this interval. We can interpret the 
derivative f’(a) in (2.4.1) as the instantaneous rate of change of f (x) with respect to x. The so-called 
instantaneous rate of change is based on the variation h in x, which approaches 0. 


To illustrate derivatives, let us experiment with an example. Define u = f(x) = 31? — 47. 


%matplotlib inline 

from d21 import mxnet as d21 
from IPython import display 
from mxnet import np, npx 
npx.set_np() 


def f(x): 

pecurmi cs Be ke 2 = BE ts 
By setting x = 1 and letting h approach 0, the numerical result of Ternita in (2.4.1) approaches 
2. Though this experiment is not a mathematical proof, we will see later that the derivative u’ is 2 
when zx = 1. 


def numerical_lim(f, x, h): 
return (f(x + h) - f(x)) / h 


h = @.1 

for i in range(5): 
print(f’h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f)') 
h x= 0.1 


h=0.10000, numerical limit=2.30000 
h=0.01000, numerical limit=2.03000 
h=0.00100, numerical limit=2.00300 
h=0.00010, numerical limit=2.00030 
h=0.00001, numerical limit=2.00003 


Let us familiarize ourselves with a few equivalent notations for derivatives. Given y = f(x), where 
x and y are the independent variable and the dependent variable of the function f, respectively. 
The following expressions are equivalent: 


1 ,_ dy df d E E 
F (1) =y de da ae (x) = Df (x) = Dz f(x), (2.4.2) 


where symbols a and D are differentiation operators that indicate operation of differentiation. We 
can use the following rules to differentiate common functions: 
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e DC =0(Cisa constant), 

> Da” = nx"! (the power rule, n is any real number), 
e DE = e, 

* Dln(x) = 1/7. 


To differentiate a function that is formed from a few simpler functions such as the above com- 
mon functions, the following rules can be handy for us. Suppose that functions f and g are both 
differentiable and C is a constant, we have the constant multiple rule 














LOrE = CHF) (2.4.3) 
the sum rule 
SO) +ga) = Ff le) + £90), (2.4.4) 
the product rule 
UDI! = FEE] + 9) QU), (2.4.5) 
and the quotient rule 
d [1] _ IEO- Ea) 
or no | ae 


Now we can apply a few of the above rules to find u' = f'(x) = 3La? — dto = 6x — 4. Thus, by 
setting x = 1, we have u’ = 2: this is supported by our earlier experiment in this section where 
the numerical result approaches 2. This derivative is also the slope of the tangent line to the curve 
u = f(x) when z = 1. 


To visualize such an interpretation of derivatives, we will use matplotlib, a popular plotting li- 
brary in Python. To configure properties of the figures produced by matplotlib, we need to define 
a few functions. In the following, the use_svg_display function specifies the matplotlib package 
to output the svg figures for sharper images. Note that the comment #@save is a special mark 
where the following function, class, or statements are saved in the d21 package so later they can 
be directly invoked (e.g., d21.use_svg_display()) without being redefined. 


def use_svg_display(): #@save 
"""Use the svg format to display a plot in Jupyter. 
display.set_matplotlib_formats('svg') 


nnn 


We define the set_figsize function to specify the figure sizes. Note that here we directly use d21. 
plt since the import statement from matplotlib import pyplot as plt has been marked for 
being saved in the d21 package in the preface. 


def set_figsize(figsize=(3.5, 2.5)): #@save 
""*"Set the figure size for matplotlib."”” 
use_svg_display() 
d21.p1t.rcParams[ 'figure.figsize'] = figsize 


The following set_axes function sets properties of axes of figures produced by matplotlib. 





68 Chapter 2. Preliminaries 


#@save 

def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend): 
"™ "Set the axes for matplotlib.”"” 
axes.set_xlabel(xlabel) 
axes.set_ylabel(ylabel) 
axes.set_xscale(xscale) 
axes.set_yscale(yscale) 
axes.set_xlim(xlim) 
axes.set_ylim(ylim) 
if legend: 

axes. legend(legend) 

axes. grid() 


With these three functions for figure configurations, we define the plot function to plot multiple 
curves succinctly since we will need to visualize many curves throughout the book. 


#@save 
def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None, 
ylim=None, xscale='linear', yscale='linear', 
fmts=('-', 'm--*, ‘g-.’, 'r:'), figsize=(3.5, 2.5), axes=None): 
EAPO tedata DONES Mai 
if legend is None: 
legend = [] 


set_figsize(figsize) 
axes = axes if axes else d2l.plt.gca() 


# Return True if `X` (tensor or list) has 1 axis 
def has_one_axis(X): 
return (hasattr(X, ”ndim”) and X.ndim == 1 or isinstance(X, list) 
and not hasattr(X[0], "__len__”)) 


if has_one_axis(X): 
X = [X] 
if Y is None: 
ANS [een OO mex 
elif has_one_axis(Y): 
Y = [Y] 
if len(X) != len(Y): 
X = X x len(Y) 


axes.cla() 
for x, y, fmt in zip(X, Y, fmts): 
if len(x): 
axes.plot(x, y, fmt) 
else: 


axes.plot(y, fmt) 
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 


Now we can plot the function u = f(x) andits tangent line y = 2x—3 at x = 1, where the coefficient 
2 is the slope of the tangent line. 


x = np.arange(0, 3, 0.1) 
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=L’f(x)’, ‘Tangent line (x=1)']) 
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— f(x) 
=-=- Tangent line (x=1) 





2.4.2 Partial Derivatives 


So far we have dealt with the differentiation of functions of just one variable. In deep learning, 
functions often depend on many variables. Thus, we need to extend the ideas of differentiation to 
these multivariate functions. 


Let y = f(x1,22,...,%p) be a function with n variables. The partial derivative of y with respect to 
its i! parameter z; is 








Fes, ey DIR 13 We T h, £i41, nea Bn) = f(z, ine EET Tn) (2.4.7) 
Ox; h-0 h 


To calculate ge, we can simply treat x1,...,%j-1, Vi41,---, Zn as constants and calculate the deriva- 
tive of y with respect to x;. For notation of partial derivatives, the following are equivalent: 
Oy Of 











fo, = fi = Dif = Da, f. (2.4.8) 





2.4.3 Gradients 


We can concatenate partial derivatives of a multivariate function with respect to all its variables 
to obtain the gradient vector of the function. Suppose that the input of function f : R” > Ris an 





n-dimensional vector x = |z1, £2, ..., £n]! and the output is a scalar. The gradient of the function 
f(x) with respect to x is a vector of n partial derivatives: 
+ 
Vxf (x) = ora Of (x) ee Of (x) , (2.4.9) 
0x1 0x2 Bb 


where Vx f(x) is often replaced by V f(x) when there is no ambiguity. 


Let x be an n-dimensional vector, the following rules are often used when differentiating multi- 
variate functions: 


° ForalA € R™”, VÍAx=A', 
° ForallA € RX”), Vx TA =A, 

















* Forall A € R”*”, Vx Ax = (A+A! )x, 


© Va [lx]? = Vxxx = 2x. 
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Similarly, for any matrix X, we have Vx||X||} = 2X. As we will see later, gradients are useful for 
designing optimization algorithms in deep learning. 


2.4.4 Chain Rule 


However, such gradients can be hard to find. Thisis because multivariate functions in deep learn- 
ing are often composite, so we may not apply any ofthe aforementioned rules to differentiate these 
functions. Fortunately, the chain rule enables us to differentiate composite functions. 


Let us first consider functions of a single variable. Suppose that functions y = f(u) and u = g(x) 
are both differentiable, then the chain rule states that 


dy  dydu 





e pe aa (2.4.10) 
dx  dudzx 
Now let us turn our attention to a more general scenario where functions have an arbitrary 
number of variables. Suppose that the differentiable function y has variables u;,uz,..., Ums 
where each differentiable function u; has variables z1, £2,...,£n. Note that y is a function of 
“1,22,...,X%n. Then the chain rule gives 
d d d 
y _ dy du | dy duz | dy dum (2.4.11) 
dx; dui dx; dug dz; dum ax; 
for any i = 1,2,...,n. 
Summary 


° Differential calculus and integral calculus are two branches of calculus, where the former 
can be applied to the ubiquitous optimization problems in deep learning. 


e A derivative can be interpreted as the instantaneous rate of change of a function with respect 
to its variable. It is also the slope of the tangent line to the curve of the function. 


e A gradient is a vector whose components are the partial derivatives of a multivariate function 
with respect to all its variables. 


* The chain rule enables us to differentiate composite functions. 


Exercises 
1. Plot the function y = f(x) = x? — 4 and its tangent line when z = 1. 
2. Find the gradient of the function f(x) = 3x? + 5e”?. 
3. What is the gradient of the function f(x) = ||x||2? 
4. Can you write out the chain rule for the case where u = f(x,y, z) and x = x(a,b), y = y(a, b), 


and z = z(a,b)? 


Discussions* 





% https://discuss.d21.ai/t/32 
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2.5 Automatic Differentiation 


As we have explained in Section 2.4, differentiation is a crucial step in nearly all deep learning 
optimization algorithms. While the calculations for taking these derivatives are straightforward, 
requiring only some basic calculus, for complex models, working out the updates by hand can be 
a pain (and often error-prone). 


Deep learning frameworks expedite this work by automatically calculating derivatives, i.e., auto- 
matic differentiation. In practice, based on our designed model the system builds a computational 
graph, tracking which data combined through which operations to produce the output. Automatic 
differentiation enables the system to subsequently backpropagate gradients. Here, backpropagate 
simply means to trace through the computational graph, filling in the partial derivatives with re- 
spect to each parameter. 


from mxnet import autograd, np, npx 
npx.set_np() 


2.5.1 A Simple Example 


As a toy example, say that we are interested in differentiating the function y = 2x'x with respect 
to the column vector x. To start, let us create the variable x and assign it an initial value. 


x = np.arange(4.0) 
Xx 


arras. Loy Zoey Bod) 


Before we even calculate the gradient of y with respect to x, we will need a place to store it. It is 
important that we do not allocate new memory every time we take a derivative with respect to a 
parameter because we will often update the same parameters thousands or millions of times and 
could quickly run out of memory. Note that a gradient of a scalar-valued function with respect to 
a vector x is itself vector-valued and has the same shape as x. 


# We allocate memory for a tensor's gradient by invoking ‘attach_grad* 

x. attach_grad() 

# After we calculate a gradient taken with respect to ‘x‘, we will be able to 
tt access it via the ‘grad* attribute, whose values are initialized with Qs 

x. grad 


array([0., 0., 0., 0.]) 


Now let us calculate y. 


# Place our code inside an ‘autograd.record* scope to build the computational 
# graph 
with autograd.record(): 

y = 2 * np.dot(x, x) 
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array(28.) 


Since x is a vector of length 4, an inner product of x and x is performed, yielding the scalar output 
that we assign to y. Next, we can automatically calculate the gradient of y with respect to each 
component of x by calling the function for backpropagation and printing the gradient. 


y .backward() 
x.grad 


EIEN Os Loy Gey 12-10) 


The gradient of the function y = 2x'x with respect to x should be 4x. Let us quickly verify that 
our desired gradient was calculated correctly. 


x.grad == 4 x x 


array([L True, True, True, True]) 


Now let us calculate another function of x. 


with autograd.record(): 
y = x.sum() 
y .backward() 
x.grad + Overwritten by the newly calculated gradient 


arrears, Mas tle, tel) 


2.5.2 Backward for Non-Scalar Variables 


Technically, when y is not a scalar, the most natural interpretation of the differentiation of a vector 
y with respect to a vector x is a matrix. For higher-order and higher-dimensional y and x, the 
differentiation result could be a high-order tensor. 


However, while these more exotic objects do show up in advanced machine learning (including in 
deep learning), more often when we are calling backward on a vector, we are trying to calculate 
the derivatives of the loss functions for each constituent of a batch of training examples. Here, our 
intent is not to calculate the differentiation matrix but rather the sum of the partial derivatives 
computed individually for each example in the batch. 


Aan 


# When we invoke ‘backward* on a vector-valued variable ‘y* (function of `x`), 


Mat 


# a new scalar variable is created by summing the elements in ‘y*. Then the 
# gradient of that scalar variable with respect to ‘x* is computed 
with autograd.record(): 
y=x xx #‘y* is a vector 
y. backward() 


x.grad # Equals to y = sum(x * x) 


anray Tona Bon Les Boll) 
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2.5.3 Detaching Computation 


Sometimes, we wish to move some calculations outside of the recorded computational graph. For 
example, say that y was calculated as a function of x, and that subsequently z was calculated as a 
function of both y and x. Now, imagine that we wanted to calculate the gradient of z with respect 
to x, but wanted for some reason to treat y as a constant, and only take into account the role that 
x played after y was calculated. 


Here, we can detach y to return a new variable u that has the same value as y but discards any 
information about how y was computed in the computational graph. In other words, the gradient 
will not flow backwards through u to x. Thus, the following backpropagation function computes 
the partial derivative of z = u * x with respect to x while treating u as a constant, instead of the 
partial derivative of z = x x x * x with respect to x. 


with autograd.record(): 
y =X * X 
u = y.detach() 
Z =u * x 
z.backward() 
x.grad == u 


array([ True, True, True, True]) 


Since the computation of y was recorded, we can subsequently invoke backpropagation on y to 
get the derivative of y = x * x with respect to x, which is 2 * x. 


y .backward() 
x.grad == 2 * x 


array([ True, True, True, True]) 


2.5.4 Computing the Gradient of Python Control Flow 


One benefit of using automatic differentiation is that even if building the computational graph 
of a function required passing through a maze of Python control flow (e.g., conditionals, loops, 
and arbitrary function calls), we can still calculate the gradient of the resulting variable. In the 
following snippet, note that the number of iterations of the while loop and the evaluation of the 
if statement both depend on the value of the input a. 


def f(a): 

b=ax2 

while np.linalg.norm(b) < 1000: 
b=bx*2 

if b.sum() > 0: 
c=b 

else: 
c = 100 x b 

return c 


Let us compute the gradient. 
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a = np.random.normal (>) 

a.attach_grad() 

with autograd.record(): 
d = f(a) 

d.backward() 


We can now analyze the f function defined above. Note that it is piecewise linear in its input a. In 
other words, for any a there exists some constant scalar k such that f(a) = k * a, where the value 
of k depends on the input a. Consequently d / a allows us to verify that the gradient is correct. 


a.grad = d / a 


array(True) 


Summary 


+ Deep learning frameworks can automate the calculation of derivatives. To use it, we first 
attach gradients to those variables with respect to which we desire partial derivatives. We 
then record the computation of our target value, execute its function for backpropagation, 
and access the resulting gradient. 


Exercises 


1. Why is the second derivative much more expensive to compute than the first derivative? 


2. After running the function for backpropagation, immediately run it again and see what hap- 
pens. 


3. In the control flow example where we calculate the derivative of d with respect to a, what 
would happen if we changed the variable a to a random vector or matrix. At this point, the 
result of the calculation f(a) is no longer a scalar. What happens to the result? How do we 
analyze this? 


4. Redesign an example of finding the gradient of the control flow. Run and analyze the result. 


5. Let f(x) = sin(x). Plot f(x) and ON where the latter is computed without exploiting that 
f'(x) = cos(x). 


Discussions“ 





“ https://discuss.d21.ai/t/34 
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2.6 Probability 


In some form or another, machine learning is all about making predictions. We might want to 
predict the probability of a patient suffering a heart attack in the next year, given their clinical his- 
tory. In anomaly detection, we might want to assess how likely a set of readings from an airplane's 
jet engine would be, were it operating normally. In reinforcement learning, we want an agent to 
act intelligently in an environment. This means we need to think about the probability of getting 
a high reward under each of the available actions. And when we build recommender systems we 
also need to think about probability. For example, say hypothetically that we worked for a large 
online bookseller. We might want to estimate the probability that a particular user would buy 
a particular book. For this we need to use the language of probability. Entire courses, majors, 
theses, careers, and even departments, are devoted to probability. So naturally, our goal in this 
section is not to teach the whole subject. Instead we hope to get you off the ground, to teach you 
just enough that you can start building your first deep learning models, and to give you enough of 
a flavor for the subject that you can begin to explore it on your own if you wish. 


We have already invoked probabilities in previous sections without articulating what precisely 
they are or giving a concrete example. Let us get more serious now by considering the first case: 
distinguishing cats and dogs based on photographs. This might sound simple but it is actually a 
formidable challenge. To start with, the difficulty of the problem may depend on the resolution 
of the image. 





Fig. 2.6.1: Images of varying resolutions (10 x 10, 20 x 20, 40 x 40, 80 x 80, and 160 x 160 pixels). 


As shown in Fig. 2.6.1, while it is easy for humans to recognize cats and dogs at the resolution of 
160 x 160 pixels, it becomes challenging at 40 x 40 pixels and next to impossible at 10 x 10 pixels. 
In other words, our ability to tell cats and dogs apart at a large distance (and thus low resolution) 
might approach uninformed guessing. Probability gives us a formal way of reasoning about our 
level of certainty. If we are completely sure that the image depicts a cat, we say that the probability 
that the corresponding label y is “cat”, denoted P(y = “cat”) equals 1. If we had no evidence to 
suggest that y = “cat” or that y = “dog”, then we might say that the two possibilities were equally 
likely expressing this as P(y = “cat”) = P(y = “dog”) = 0.5. If we were reasonably confident, but 
not sure that the image depicted a cat, we might assign a probability 0.5 < P(y = cat”) < 1. 
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Now consider the second case: given some weather monitoring data, we want to predict the proba- 
bility that it will rain in Taipei tomorrow. Ifitis summertime, the rain might come with probability 
0.5. 


In both cases, we have some value of interest. And in both cases we are uncertain about the out- 
come. But there is a key difference between the two cases. In this first case, the image is in fact 
either a dog or a cat, and we just do not know which. In the second case, the outcome may actu- 
ally be a random event, if you believe in such things (and most physicists do). So probability is a 
flexible language for reasoning about our level of certainty, and it can be applied effectively in a 
broad set of contexts. 


2.6.1 Basic Probability Theory 


Say that we cast a die and want to know what the chance is of seeing a 1 rather than another digit. 
If the die is fair, all the six outcomes {1,...,6} are equally likely to occur, and thus we would see 
a 1 in one out of six cases. Formally we state that 1 occurs with probability 5: 


For a real die that we receive from a factory, we might not know those proportions and we would 
need to check whether it is tainted. The only way to investigate the die is by casting it many times 
and recording the outcomes. For each cast of the die, we will observe a value in {1,...,6}. Given 
these outcomes, we want to investigate the probability of observing each outcome. 


One natural approach for each value is to take the individual count for that value and to divide it 
by the total number of tosses. This gives us an estimate of the probability of a given event. The law 
of large numbers tell us that as the number of tosses grows this estimate will draw closer and closer 
to the true underlying probability. Before going into the details of what is going here, let us try it 
out. 


To start, let us import the necessary packages. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import np, npx 
import random 

npx.set_np() 


Next, we will want to be able to cast the die. In statistics we call this process of drawing examples 
from probability distributions sampling. The distribution that assigns probabilities to a number 
of discrete choices is called the multinomial distribution. We will give a more formal definition of 
distribution later, but at a high level, think of it as just an assignment of probabilities to events. 


To draw a single sample, we simply pass in a vector of probabilities. The output is another vector 
of the same length: its value at index i is the number of times the sampling outcome corresponds 
to t. 


fair_probs = [1.0 / 6] * 6 
np.random.multinomial(1, fair_probs) 


array([0, 0, 0, 1, 0, 0], dtype=int64) 


If you run the sampler a bunch of times, you will find that you get out random values each time. 
As with estimating the fairness of a die, we often want to generate many samples from the same 
distribution. It would be unbearably slow to do this with a Python for loop, so the function we are 





2.6. Probability 77 


using supports drawing multiple samples at once, returning an array of independent samples in 
any shape we might desire. 


np.random.multinomial(10, fair_probs) 


array([1, 1, 5, 1, 1, 1], dtype=int64) 


Now that we know how to sample rolls of a die, we can simulate 1000 rolls. We can then go through 
and count, after each of the 1000 rolls, how many times each number was rolled. Specifically, we 
calculate the relative frequency as the estimate of the true probability. 


counts = np.random.multinomial(1000, fair_probs).astype(np.float32) 
counts / 1000 


array([0.162, 0.149, 0.178, 0.17 , 0.166, 0.175]) 


Because we generated the data from a fair die, we know that each outcome has true probability 2, 
roughly 0.167, so the above output estimates look good. 


We can also visualize how these probabilities converge over time towards the true probability. Let 
us conduct 500 groups of experiments where each group draws 10 samples. 


counts = np.random.multinomial(10, fair_probs, size=500) 
cum_counts = counts.astype(np. float32) .cumsum(axis=0) 
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True) 


d21.set_figsize((6, 4.5)) 
for i in range(6): 

d21.plt.plot(estimates[:, i].asnumpy(), 

label=("P(die=" + str(i + 1) + ")")) 

d21.plt.axhline(y=0.167, color='black', linestyle='dashed’) 
d21.p1t.gca().set_xlabel('Groups of experiments’) 
d21.p1t.gca().set_ylabel('Estimated probability’) 
d21.plt.legend(); 
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Each solid curve corresponds to one of the six values of the die and gives our estimated probability 
that the die turns up that value as assessed after each group of experiments. The dashed black line 
gives the true underlying probability. As we get more data by conducting more experiments, the 
6 solid curves converge towards the true probability. 


Axioms of Probability Theory 


When dealing with the rolls of a die, we call the set S = {1, 2,3, 4,5, 6} the sample space or outcome 
space, where each element is an outcome. An event is a set of outcomes from a given sample space. 
For instance, “seeing a 5” ({5}) and “seeing an odd number” ({1,3,5}) are both valid events of 
rolling a die. Note that if the outcome of a random experiment is in event A, then event A has 
occurred. That is to say, if 3 dots faced up after rolling a die, since 3 € {1,3,5}, we can say that the 
event “seeing an odd number” has occurred. 


Formally, probability can be thought of a function that maps a set to a real value. The probability 
of an event A in the given sample space S, denoted as P(.A), satisfies the following properties: 


e For any event A, its probability is never negative, i.e., P(A) > 0; 
e Probability of the entire sample space is 1, i.e., P(S) = 1; 


+ For any countable sequence of events A4, Az, . . . that are mutually exclusive (A;1.A; = 0 for all 
i Æ j), the probability that any happens is equal to the sum of their individual probabilities, 
i.e, PUEy Ai) = Dir P(A). 


These are also the axioms of probability theory, proposed by Kolmogorov in 1933. Thanks to this 
axiom system, we can avoid any philosophical dispute on randomness; instead, we can reason 
rigorously with a mathematical language. For instance, by letting event A; be the entire sample 
space and A; = @ for alli > 1, we can prove that P(Ø) = 0, i.e., the probability of an impossible 
event is 0. 
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Random Variables 


In our random experiment of casting a die, we introduced the notion of a random variable. A ran- 
dom variable can be pretty much any quantity and is not deterministic. It could take one value 
among a set of possibilities in a random experiment. Consider a random variable X whose value 
is in the sample space S = {1,2,3,4,5,6} of rolling a die. We can denote the event “seeing a 5” 
as {X = 5} or X = 5, and its probability as P({X = 5)) or P(X = 5). By P(X = a), we make a 
distinction between the random variable X and the values (e.g., a) that X can take. However, such 
pedantry results in a cumbersome notation. For a compact notation, on one hand, we can just de- 
note P(X) as the distribution over the random variable X: the distribution tells us the probability 
that X takes any value. On the other hand, we can simply write P(a) to denote the probability that 
a random variable takes the value a. Since an event in probability theory is a set of outcomes from 
the sample space, we can specify a range of values for a random variable to take. For example, 
P(1 < X < 3) denotes the probability of the event {1 < X < 3}, which means {X = 1,2, or, 3}. 
Equivalently, P(1 < X < 3) represents the probability that the random variable X can take a 
value from {1, 2,3}. 


Note that there is a subtle difference between discrete random variables, like the sides of a die, 
and continuous ones, like the weight and the height of a person. There is little point in ask- 
ing whether two people have exactly the same height. If we take precise enough measure- 
ments you will find that no two people on the planet have the exact same height. In fact, if 
we take a fine enough measurement, you will not have the same height when you wake up and 
when you go to sleep. So there is no purpose in asking about the probability that someone is 
1.80139278291028719210196740527486202 meters tall. Given the world population of humans the 
probability is virtually 0. It makes more sense in this case to ask whether someone's height falls 
into a given interval, say between 1.79 and 1.81 meters. In these cases we quantify the likelihood 
that we see a value as a density. The height of exactly 1.80 meters has no probability, but nonzero 
density. In the interval between any two different heights we have nonzero probability. In the rest 
of this section, we consider probability in discrete space. For probability over continuous random 
variables, you may refer to Section 18.6. 


2.6.2 Dealing with Multiple Random Variables 


Very often, we will want to consider more than one random variable at a time. For instance, we 
may want to model the relationship between diseases and symptoms. Given a disease and a symp- 
tom, say “flu” and “cough”, either may or may not occur in a patient with some probability. While 
we hope that the probability of both would be close to zero, we may want to estimate these prob- 
abilities and their relationships to each other so that we may apply our inferences to effect better 
medical care. 


As amore complicated example, images contain millions of pixels, thus millions of random vari- 
ables. And in many cases images will come with a label, identifying objects in the image. We can 
also think of the label as a random variable. We can even think of all the metadata as random 
variables such as location, time, aperture, focal length, ISO, focus distance, and camera type. All 
of these are random variables that occur jointly. When we deal with multiple random variables, 
there are several quantities of interest. 
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Joint Probability 


The first is called the joint probability P(A = a, B = b). Given any values a and b, the joint proba- 
bility lets us answer, what is the probability that A = a and B = b simultaneously? Note that for 
any values a and b, P(A = a, B = b) < P(A = a). This has to be the case, since for A = a and 
B = bto happen, A = a has to happen and B = b also has to happen (and vice versa). Thus, A = a 
and B = bcannot be more likely than A = aor B = b individually. 


Conditional Probability 


P(A=a 


This brings us to an interesting ratio: 0 < nea < 1. We call this ratio a conditional probability 


and denote it by P(B = b | A =a): itis the probability of B = b, provided that A = a has occurred. 





Bayes’ theorem 


Using the definition of conditional probabilities, we can derive one of the most useful and cel- 
ebrated equations in statistics: Bayes’ theorem. It goes as follows. By construction, we have the 
multiplication rule that P(A, B) = P(B | A)P(A). By symmetry, this also holds for P(A, B) = 
P(A | B)P(B). Assume that P(B) > 0. Solving for one of the conditional variables we get 


P(B | A)P(A) 


P(A| B)= “Sr 


(2.6.1) 
Note that here we use the more compact notation where P(A, B) isa joint distribution and P(A | B) 
is a conditional distribution. Such distributions can be evaluated for particular values A = a, B = b. 


Marginalization 


Bayes’ theorem is very useful if we want to infer one thing from the other, say cause and effect, 
but we only know the properties in the reverse direction, as we will see later in this section. One 
important operation that we need, to make this work, is marginalization. It is the operation of 
determining P(B) from P(A, B). We can see that the probability of B amounts to accounting for 
all possible choices of A and aggregating the joint probabilities over all of them: 


P(B) = S P(A, B), (2.6.2) 
A 


which is also known as the sum rule. The probability or distribution as a result of marginalization 
is called a marginal probability or a marginal distribution. 


Independence 


Another useful property to check for is dependence vs. independence. Two random variables A and 
B being independent means that the occurrence of one event of A does not reveal any information 
about the occurrence of an event of B. In this case P(B | A) = P(B). Statisticians typically 
express this as A | B. From Bayes’ theorem, it follows immediately that also P(A | B) = P(A). 
In all the other cases we call A and B dependent. For instance, two successive rolls of a die are 
independent. In contrast, the position of a light switch and the brightness in the room are not 
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(they are not perfectly deterministic, though, since we could always have a broken light bulb, 
power failure, or a broken switch). 





Since P(A | B) = eat = P(A) is equivalent to P(A, B) = P(A) P(B), two random variables are 
independent if and only if their joint distribution is the product of their individual distributions. 
Likewise, two random variables A and B are conditionally independent given another random vari- 


able C if and only if P(A, B | C) = P(A | C)P(B | C). This is expressed as A L B | C. 


Application 


Let us put our skills to the test. Assume that a doctor administers an HIV test to a patient. This 
test is fairly accurate and it fails only with 1% probability if the patient is healthy but reporting 
him as diseased. Moreover, it never fails to detect HIV if the patient actually has it. We use D; to 
indicate the diagnosis (1 if positive and 0 if negative) and H to denote the HIV status (1 if positive 
and 0 if negative). Table 2.6.1 lists such conditional probabilities. 


Table 2.6.1: Conditional probability of P(D, | H). 
Conditional probability | H=1 | H =0 
P(Dı =1| H) 1 0.01 
P(Dı =0| H) 0 0.99 


























Note that the column sums are all 1 (but the row sums are not), since the conditional probabil- 
ity needs to sum up to 1, just like the probability. Let us work out the probability of the patient 
having HIV if the test comes back positive, i.e., P(H = 1 | Dı = 1). Obviously this is going to 
depend on how common the disease is, since it affects the number of false alarms. Assume that 
the population is quite healthy, e.g., P(H = 1) = 0.0015. To apply Bayes’ theorem, we need to 
apply marginalization and the multiplication rule to determine 

















P(D, =1) 
P(D=1,H=W+*P(D,=L,H=1) Gea 
PD) =1| H =0)P(H =0)+P(Dı=1|H =1)P(H =) Ba 
=0.011485. 
Thus, we get 
P(H =1|Dı=1) 
_P(D:ı =1| H =1)P(H =1) 
POD . (2.6.4) 
=0.1306 


In other words, there is only a 13.06% chance that the patient actually has HIV, despite using a 
very accurate test. As we can see, probability can be counterintuitive. 


What should a patient do upon receiving such terrifying news? Likely, the patient would ask the 
physician to administer another test to get clarity. The second test has different characteristics 
and itis not as good as the first one, as shown in Table 2.6.2. 


Table 2.6.2: Conditional probability of P(Də | H). 
Conditional probability | H=1 | H=0 
P(Də=1 | H) 0.98 0.03 
P(Da=0] H) 0.02 0.97 
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Unfortunately, the second test comes back positive, too. Let us work out the requisite probabilities 
to invoke Bayes’ theorem by assuming the conditional independence: 








P(D, =1,D2, =1| H =0) 
P(D, =1| H =0)P(D2 =1| H =0) (2.6.5) 
=0.0003, 














P(D, =1,D2, =1| H =1) 
P(D, =1| H =1)P(D) =1| H =1) (2.6.6) 
=0.98. 








Now we can apply marginalization and the multiplication rule: 


PDS Ds 21) 
P(D, 1, Da 1,H 0) + P(D, 1, Da 1,H 1) 
































2.6.7 
P(Dı =1, D: =1| H =0)P(H = 0) + P(Dı = 1, D2 = 1 | H = 1)P(H = 1) Co) 
=0.00176955. 
In the end, the probability of the patient having HIV given both positive tests is 
P(H =1| Dı =1,D2= 1) 
_P Dı = 1, Də =1| H =1)P(H 
P(D, =1, D2 =1) 
=0.8307. 


That is, the second test allowed us to gain much higher confidence that not all is well. Despite the 
second test being considerably less accurate than the first one, it still significantly improved our 
estimate. 


2.6.3 Expectation and Variance 


To summarize key characteristics of probability distributions, we need some measures. The ex- 
pectation (or average) of the random variable X is denoted as 


x)=) PX =): (2.6.9) 


When the input of a function f(z) is a random variable drawn from the distribution P with differ- 
ent values x, the expectation of f (x) is computed as 


Enplf = Dail f(a (2.6.10) 


In many cases we want to measure by how much the random variable X deviates from its expec- 
tation. This can be quantified by the variance 


Var[X] = E [(X — E[X])*] = E[X*] — E[X?. (2.6.11) 


Its square root is called the standard deviation. The variance of a function of a random variable 
measures by how much the function deviates from the expectation of the function, as different 
values x of the random variable are sampled from its distribution: 


Var[f(2)] = E |(f(#) - ELf(@)))’) - (2.6.12) 
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Summary 


+ We can sample from probability distributions. 


+ We can analyze multiple random variables using joint distribution, conditional distribution, 
Bayes’ theorem, marginalization, and independence assumptions. 


+ Expectation and variance offer useful measures to summarize key characteristics of proba- 
bility distributions. 


Exercises 
1. We conducted m = 500 groups of experiments where each group draws n = 10 samples. 
Vary m and n. Observe and analyze the experimental results. 


2. Given two events with probability P(A) and P(B), compute upper and lower bounds on 
P(AUB) and P(AN B). (Hint: display the situation using a Venn Diagram*.) 


3. Assume that we have a sequence of random variables, say A, B, and C, where B only de- 
pends on A, and C only depends on B, can you simplify the joint probability P(A, B, C)? 
(Hint: this is a Markov Chain*®.) 


4. In Section 2.6.2, the first test is more accurate. Why not run the first test twice rather than 
run both the first and second tests? 


Discussions“ 


2.7 Documentation 


Due to constraints on the length of this book, we cannot possibly introduce every single MXNet 
function and class (and you probably would not want us to). The API documentation and addi- 
tional tutorials and examples provide plenty of documentation beyond the book. In this section 
we provide you with some guidance to exploring the MXNet API. 


2.7.1 Finding All the Functions and Classes in a Module 


In order to know which functions and classes can be called in a module, we invoke the dir func- 
tion. For instance, we can query all properties in the module for generating random numbers: 


from mxnet import np 
print(dir(np.random)) 











AE online 7, A a ENE AAA o CA CARA o dc A ame o 

> '__package__', '__spec__', '_mx_nd_np', ‘beta’, 'chisquare', 'choice', ‘exponential’, 

>'gamma', 'gumbel', ‘logistic’, ‘lognormal’, ‘multinomial’, 'multivariate_normal’, ‘normal’, 

> 'pareto', ‘power’, ‘rand’, 'randint', 'randn', 'rayleigh', ‘shuffle’, ‘uniform’, 'weibull 
1 

>S'] 





5 https://en.wikipedia.org/wiki/Venn_diagram 
“© https://en.wikipedia.org/wiki/Markov_chain 
“7 https://discuss.d21.ai/t/36 
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Generally, we can ignore functions that start and end with __ (special objects in Python) or func- 
tions that start with a single _(usually internal functions). Based on the remaining function or 
attribute names, we might hazard a guess that this module offers various methods for generating 
random numbers, including sampling from the uniform distribution (uniform), normal distribu- 
tion (normal), and multinomial distribution (multinomial). 


2.7.2 Finding the Usage of Specific Functions and Classes 


For more specific instructions on how to use a given function or class, we can invoke the help 
function. As an example, let us explore the usage instructions for tensors’ ones function. 


help(np.ones) 


Help on function ones in module mxnet.numpy: 


ones(shape, dtype=<class 'numpy.float32'>, order='C*, ctx=None) 
Return a new array of given shape and type, filled with ones. 
This function currently only supports storing multi-dimensional data 
in row-major (C-style). 


Parameters 

shape : int or tuple of int 
The shape of the empty array. 

dtype : str or numpy.dtype, optional 
An optional value type. Default is numpy.float32. Note that this 
behavior is different from NumPy’s ones function where float64 
is the default value, because float32 is considered as the default 
data type in deep learning. 

order : {'C'}, optional, default: 'C” 
How to store multi-dimensional data in memory, currently only row-major 
(C-style) is supported. 

ctx : Context, optional 
An optional device context (default is the current default context). 


Returns 


out : ndarray 
Array of ones with the given shape, dtype, and ctx. 


Examples 


>>> np.ones(5) 
array([1., 1., 1., 1., 1.]) 


>>> np.ones((5,), dtype=int) 
array([1, 1, 1, 1, 1], dtype=int64) 


>>> np.ones((2, 1)) 
array([[1.1], 
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[1.1]) 


>>> s = (2,2) 

>>> np.ones(s) 

array([[1., 1.], 
[1., 1.1) 


From the documentation, we can see that the ones function creates a new tensor with the specified 
shape and sets all the elements to the value of 1. Whenever possible, you should run a quick test 
to confirm your interpretation: 


np.ones(4) 


antav Liles oy tes Tod) 


In the Jupyter notebook, we can use ? to display the document in another window. For example, 
list? will create content that is almost identical to help(list), displaying it in a new browser 
window. In addition, if we use two question marks, such as list??, the Python code implementing 
the function will also be displayed. 


Summary 


° The official documentation provides plenty of descriptions and examples that are beyond 
this book. 


e We can look up documentation for the usage of an API by calling the dir and help functions, 
or ? and ?? in Jupyter notebooks. 


Exercises 


1. Look up the documentation for any function or class in the deep learning framework. Can 
you also find the documentation on the official website of the framework? 


Discussions*® 





48 https://discuss.d21.ai/t/38 
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3 Linear Neural Networks 


Before we get into the details of deep neural networks, we need to cover the basics of neural net- 
work training. In this chapter, we will cover the entire training process, including defining simple 
neural network architectures, handling data, specifying a loss function, and training the model. 
In order to make things easier to grasp, we begin with the simplest concepts. Fortunately, classic 
statistical learning techniques such as linear and softmax regression can be cast as linear neural 
networks. Starting from these classic algorithms, we will introduce you to the basics, providing 
the basis for more complex techniques in the rest of the book. 


3.1 Linear Regression 


Regression refers to a set of methods for modeling the relationship between one or more indepen- 
dent variables and a dependent variable. In the natural sciences and social sciences, the purpose 
of regression is most often to characterize the relationship between the inputs and outputs. Ma- 
chine learning, on the other hand, is most often concerned with prediction. 


Regression problems pop up whenever we want to predict a numerical value. Common exam- 
ples include predicting prices (of homes, stocks, etc.), predicting length of stay (for patients in 
the hospital), demand forecasting (for retail sales), among countless others. Not every prediction 
problem is a classic regression problem. In subsequent sections, we will introduce classification 
problems, where the goal is to predict membership among a set of categories. 


3.1.1 Basic Elements of Linear Regression 


Linear regression may be both the simplest and most popular among the standard tools to regres- 
sion. Dating back to the dawn of the 19th century, linear regression flows from a few simple 
assumptions. First, we assume that the relationship between the independent variables x and the 
dependent variable y is linear, i.e., that y can be expressed as a weighted sum of the elements 
in x, given some noise on the observations. Second, we assume that any noise is well-behaved 
(following a Gaussian distribution). 


To motivate the approach, let us start with a running example. Suppose that we wish to estimate 
the prices of houses (in dollars) based on their area (in square feet) and age (in years). To actually 
develop a model for predicting house prices, we would need to get our hands on a dataset consist- 
ing of sales for which we know the sale price, area, and age for each home. In the terminology of 
machine learning, the datasetis called a training dataset or training set, and each row (here the data 
corresponding to one sale) is called an example (or data point, data instance, sample). The thing we 
are trying to predict (price) is called a label (or target). The independent variables (age and area) 
upon which the predictions are based are called features (or covariates). 
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Typically, we will use n to denote the number of examples in our dataset. We index the data ex- 


amples by i, denoting each input as x“) = 2, TT and the corresponding label as yl. 
p y 8 p 1 T2 


Linear Model 


The linearity assumption just says that the target (price) can be expressed as a weighted sum of 
the features (area and age): 


price = Warea * area + Wage - age + b. (3.1.1) 


In (3.1.1), Warea and wage are called weights, and b is called a bias (also called an offset or intercept). 
The weights determine the influence of each feature on our prediction and the bias just says what 
value the predicted price should take when all of the features take value 0. Even if we will never 
see any homes with zero area, or that are precisely zero years old, we still need the bias or else we 
will limit the expressivity of our model. Strictly speaking, (3.1.1) is an affine transformation of input 
features, which is characterized by a linear transformation of features via weighted sum, combined 
with a translation via the added bias. 


Given a dataset, our goal is to choose the weights w and the bias b such that on average, the pre- 
dictions made according to our model best fit the true prices observed in the data. Models whose 
output prediction is determined by the affine transformation of input features are linear models, 
where the affine transformation is specified by the chosen weights and bias. 


In disciplines where it is common to focus on datasets with just a few features, explicitly express- 
ing models long-form like this is common. In machine learning, we usually work with high- 
dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs 
consist of d features, we express our prediction y (in general the “hat” symbol denotes estimates) 
as 


= W111 +... + WaZa + b. (3.1.2) 


Collecting all features into a vector x € R? and all weights into a vector w € R?, we can express 
our model compactly using a dot product: 


g=w'x+b. (3.1.3) 


In (3.1.3), the vector x corresponds to features of a single data example. We will often find it 
convenient to refer to features of our entire dataset of n examples via the design matrix X e R"*?. 
Here, X contains one row for every example and one column for every feature. 


For a collection of features X, the predictions y € R” can be expressed via the matrix-vector prod- 
uct: 


Y =Xw+, (3.1.4) 


where broadcasting (see Section 2.1.3) is applied during the summation. Given features of a train- 
ing dataset X and corresponding (known) labels y, the goal of linear regression is to find the weight 
vector w and the bias term b that given features of a new data example sampled from the same 
distribution as X, the new example’s label will (in expectation) be predicted with the lowest error. 


Even if we believe that the best model for predicting y given x is linear, we would not expect to 
find a real-world dataset of n examples where yl) exactly equals w'x() + bfor all 1 < i < n. For 
example, whatever instruments we use to observe the features X and labels y might suffer small 
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amount of measurement error. Thus, even when we are confidentthatthe underlying relationship 
is linear, we will incorporate a noise term to account for such errors. 


Before we can go about searching for the best parameters (or model parameters) w and b, we will need 
two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the 
model to improve its quality. 


Loss Function 


Before we start thinking about how to fit data with our model, we need to determine a measure of 
fitness. The loss function quantifies the distance between the real and predicted value of the target. 
The loss will usually be a non-negative number where smaller values are better and perfect pre- 
dictions incur a loss of 0. The most popular loss function in regression problems is the squared 
error. When our prediction for an example i is ¿(Y and the corresponding true label is yl”, the 
squared error is given by: 


1 (w, b) = ; (90 — yy. (3.1.5) 
The constant 5 makes no real difference but will prove notationally convenient, canceling out 
when we take the derivative of the loss. Since the training dataset is given to us, and thus out of 
our control, the empirical error is only a function of the model parameters. To make things more 
concrete, consider the example below where we plot a regression problem for a one-dimensional 
case as shown in Fig. 3.1.1. 





Fig. 3.1.1: Fit data with a linear model. 


Note that large differences between estimates ¿() and observations y lead to even larger contri- 
butions to the loss, due to the quadratic dependence. To measure the quality of a model on the 
entire dataset of n examples, we simply average (or equivalently, sum) the losses on the training 
set. 


e” 13 aN 
me (i) Za = (wily 40 
L(w, b) 2 (w, b) ie +b-y ) (3.1.6) 
When training the model, we want to find parameters (w*, b*) that minimize the total loss across 
all training examples: 


w*,b* =argmin L(w, b). (3.1.7) 


w,b 
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Analytic Solution 


Linear regression happens to be an unusually simple optimization problem. Unlike most other 
models that we will encounter in this book, linear regression can be solved analytically by applying 
a simple formula. To start, we can subsume the bias b into the parameter w by appending a column 
to the design matrix consisting of all ones. Then our prediction problem is to minimize ||y —Xw]||?. 
There is just one critical point on the loss surface and it corresponds to the minimum of the loss 
over the entire domain. Taking the derivative of the loss with respect to w and setting it equal to 
zero yields the analytic (closed-form) solution: 


w'=(X Xx) xy. (3.1.8) 


While simple problems like linear regression may admit analytic solutions, you should not get 
used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the 
requirement of an analytic solution is so restrictive that it would exclude all of deep learning. 


Minibatch Stochastic Gradient Descent 


Even in cases where we cannot solve the models analytically, it turns out that we can still train 
models effectively in practice. Moreover, for many tasks, those difficult-to-optimize models turn 
out to be so much better that figuring out how to train them ends up being well worth the trouble. 


The key technique for optimizing nearly any deep learning model, and which we will call upon 
throughout this book, consists of iteratively reducing the error by updating the parameters in the 
direction that incrementally lowers the loss function. This algorithm is called gradient descent. 


The most naive application of gradient descent consists of taking the derivative of the loss func- 
tion, which is an average of the losses computed on every single example in the dataset. In prac- 
tice, this can be extremely slow: we must pass over the entire dataset before making a single 
update. Thus, we will often settle for sampling a random minibatch of examples every time we 
need to compute the update, a variant called minibatch stochastic gradient descent. 


In each iteration, we first randomly sample a minibatch B consisting of a fixed number of training 
examples. We then compute the derivative (gradient) of the average loss on the minibatch with 
regard to the model parameters. Finally, we multiply the gradient by a predetermined positive 
value 7 and subtract the resulting term from the current parameter values. 


We can express the update mathematically as follows (9 denotes the partial derivative): 
(w, b) = - By LS Ot (w, b). (3.1.9) 
1€B 


To summarize, steps of the algorithm are the following: (i) we initialize the values of the model 
parameters, typically at random; (ii) we iteratively sample random minibatches from the data, 
updating the parameters in the direction of the negative gradient. For quadratic losses and affine 
transformations, we can write this out explicitly as follows: 


wew- ggj Onl (w, b) =w- Sox (wr x) +b- y), 


[B| 
E A 1€B a A" (3.1.10) 
hh Onl’ (w, b) = b — wx’+b-y 
By Od) =5- Te (wT ) 


Note that w and x are vectors in (3.1.10). Here, the more elegant vector notation makes the math 
much more readable than expressing things in terms of coefficients, say w1, w2,..., wq. The set 
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cardinality |B| represents the number of examples in each minibatch (the batch size) and y denotes 
the learning rate. We emphasize that the values of the batch size and learning rate are manually 
pre-specified and not typically learned through model training. These parameters that are tun- 
able but not updated in the training loop are called hyperparameters. Hyperparameter tuning is the 
process by which hyperparameters are chosen, and typically requires that we adjust them based 
on the results of the training loop as assessed on a separate validation dataset (or validation set). 


After training for some predetermined number of iterations (or until some other stopping criteria 
are met), we record the estimated model parameters, denoted w, b. Note that even if our function 
is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because, 
although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a 
finite number of steps. 


Linear regression happens to be a learning problem where there is only one minimum over the 
entire domain. However, for more complicated models, like deep networks, the loss surfaces 
contain many minima. Fortunately, for reasons that are not yet fully understood, deep learning 
practitioners seldom struggle to find parameters that minimize the loss on training sets. The more 
formidable task is to find parameters that will achieve low loss on data that we have not seen 
before, a challenge called generalization. We return to these topics throughout the book. 


Making Predictions with the Learned Model 


Given the learned linear regression model w' x + 6, we can now estimate the price of a new house 
(not contained in the training data) given its area xı and age x2. Estimating targets given features 
is commonly called prediction or inference. 


We will try to stick with prediction because calling this step inference, despite emerging as standard 
jargon in deep learning, is somewhat of a misnomer. In statistics, inference more often denotes 
estimating parameters based on a dataset. This misuse of terminology is a common source of 
confusion when deep learning practitioners talk to statisticians. 


3.1.2 Vectorization for Speed 


When training our models, we typically want to process whole minibatches of examples simulta- 
neously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear 
algebra libraries rather than writing costly for-loops in Python. 


%matplotlib inline 

from d21 import mxnet as d21 
import math 

from mxnet import np 

import time 


To illustrate why this matters so much, we can consider two methods for adding vectors. To start 
we instantiate two 10000-dimensional vectors containing all ones. In one method we will loop 
over the vectors with a Python for-loop. In the other method we will rely on a single call to +. 


n = 10000 


a = np.ones(n) 
b = np.ones(n) 


Since we will benchmark the running time frequently in this book, let us define a timer. 
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class Timer: #@save 
"""Record multiple running times. 
def __init__(self): 
self.times = [] 
self .start() 


nnn 


def start(self): 
eS tant the timer. aan 
self.tik = time.time() 


def stop(self): 
"""Stop the timer and record the time in a list. 
self .times.append(time.time() - self.tik) 
return self.times[-1] 


nun 


def avg(self): 
"""Return the average time. 
return sum(self.times) / len(self.times) 


nun 


def sum(self): 
"""Return the sum of time. 
return sum(self.times) 


nnn 


def cumsum(self): 
""*”"Return the accumulated time. 
return np.array(self.times).cumsum().tolist() 


nnn 


Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a 
for-loop. 


c = np.zeros(n) 
timer = Timer() 
for i in range(n): 
cli] = ali] + bli] 
f'{timer.stop():.5f} sec’ 


"4.26972 sec' 


Alternatively, we rely on the reloaded + operator to compute the elementwise sum. 


timer.start() 
d=a+b 
f'{timer.stop():.5f} sec’ 


'0.00029 sec' 


You probably noticed that the second method is dramatically faster than the first. Vectorizing 
code often yields order-of-magnitude speedups. Moreover, we push more of the mathematics to 
the library and need not write as many calculations ourselves, reducing the potential for errors. 
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3.1.3 The Normal Distribution and Squared Loss 


While you can already get your hands dirty using only the information above, in the following we 
can more formally motivate the squared loss objective via assumptions about the distribution of 
noise. 


Linear regression was invented by Gauss in 1795, who also discovered the normal distribution 
(also called the Gaussian). It turns out that the connection between the normal distribution and 
linear regression runs deeper than common parentage. To refresh your memory, the probability 
density of a normal distribution with mean y and variance o? (standard deviation ø) is given as 





plz) = == exp ( 3 (x 2) . (3.1.11) 


Below we define a Python function to compute the normal distribution. 


def normal(x, mu, sigma): 
p = 1 / math.sqrt(2 * math.pi * sigmax*2) 
return p * np.exp(-0.5 / sigmax*x2 * (x - mu)**2) 


We can now visualize the normal distributions. 


# Use numpy again for visualization 
x = np.arange(-7, 7, 0.01) 


# Mean and standard deviation pairs 

params = [(0, 1), (0, 2), (3, 1)] 

d21.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x', 
ylabel='p(x)', figsize=(4.5, 2.5), 
legend=[f'mean {mu}, std {sigma}’ for mu, sigma in params]) 


—— mean 0, std 1 
=== mean 0, std 2 
—-= mean 3, std 1 


p(x) 





As we can see, changing the mean corresponds to a shift along the x-axis, and increasing the 
variance spreads the distribution out, lowering its peak. 


One way to motivate linear regression with the mean squared error loss function (or simply 
squared loss) is to formally assume that observations arise from noisy observations, where the 
noise is normally distributed as follows: 


y =w'x+b+ewhere e ~ N(0,0?). (3.1.12) 
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Thus, we can now write out the likelihood of seeing a particular y for a given x via 








Ply| x)= == exp ( 3 (y—w'x n?) (3.1.13) 


Now, according to the principle of maximum likelihood, the best values of parameters w and b are 
those that maximize the likelihood of the entire dataset: 


Ply | X) = | [p(y |x). (3.1.14) 
i=l 


Estimators chosen according to the principle of maximum likelihood are called maximum likeli- 
hood estimators. While, maximizing the product of many exponential functions, might look diffi- 
cult, we can simplify things significantly, without changing the objective, by maximizing the log 
of the likelihood instead. For historical reasons, optimizations are more often expressed as mini- 
mization rather than maximization. So, without changing anything we can minimize the negative 
log-likelihood — log P(y | X). Working out the mathematics gives us: 


L1 2 1 i Tei i 
— log P(y | X) = 2, z log(270*) + 207 (y )—wlx® — b) ; (3.1.15) 
Now we just need one more assumption that o is some fixed constant. Thus we can ignore the first 
term because it does not depend on w or b. Now the second term is identical to the squared error 
loss introduced earlier, except for the multiplicative constant +. Fortunately, the solution does 
not depend on o. It follows that minimizing the mean squared error is equivalent to maximum 
likelihood estimation of a linear model under the assumption of additive Gaussian noise. 


3.1.4 From Linear Regression to Deep Networks 
So far we only talked about linear models. While neural networks cover a much richer family of 


models, we can begin thinking of the linear model as a neural network by expressing it in the 
language of neural networks. To begin, let us start by rewriting things in a “layer” notation. 


Neural Network Diagram 
Deep learning practitioners like to draw diagrams to visualize what is happening in their models. 
In Fig. 3.1.2, we depict our linear regression model as a neural network. Note that these diagrams 


highlight the connectivity pattern such as how each input is connected to the output, but not the 
values taken by the weights or biases. 


Output layer 


Input layer 





Fig. 3.1.2: Linear regression is a single-layer neural network. 


For the neural network shown in Fig. 3.1.2, the inputs are z1,..., £q, so the number of inputs (or 
feature dimensionality) in the input layer is d. The output of the network in Fig. 3.1.2 is 01, so the 
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number of outputs in the output layer is 1. Note that the input values are all given and there is just 
a single computed neuron. Focusing on where computation takes place, conventionally we do not 
consider the input layer when counting layers. That is to say, the number of layers for the neural 
network in Fig. 3.1.2 is 1. We can think of linear regression models as neural networks consisting 
of just a single artificial neuron, or as single-layer neural networks. 


Since for linear regression, every input is connected to every output (in this case there is only one 
output), we can regard this transformation (the output layer in Fig. 3.1.2) as a fully-connected layer 
or dense layer. We will talk a lot more about networks composed of such layers in the next chapter. 


Biology 


Since linear regression (invented in 1795) predates computational neuroscience, it might seem 
anachronistic to describe linear regression as a neural network. To see why linear models were a 
natural place to begin when the cyberneticists/neurophysiologists Warren McCulloch and Walter 
Pitts began to develop models of artificial neurons, consider the cartoonish picture of a biological 
neuron in Fig. 3.1.3, consisting of dendrites (input terminals), the nucleus (CPU), the axon (out- 
put wire), and the axon terminals (output terminals), enabling connections to other neurons via 
synapses. 


Dendrite 
Axon Terminal 








Node of 
Ranvier 





Cell body 


Schwann cell 


Myelin sheath 
Nucleus 


Fig. 3.1.3: The real neuron. 


Information x; arriving from other neurons (or environmental sensors such as the retina) is re- 
ceived in the dendrites. In particular, that information is weighted by synaptic weights w; determin- 
ing the effect of the inputs (e.g., activation or inhibition via the product x;w;). The weighted inputs 
arriving from multiple sources are aggregated in the nucleus as a weighted sum y = >, ziw; + b, 
and this information is then sent for further processing in the axon y, typically after some nonlin- 
ear processing via o(y). From there it either reaches its destination (e.g., a muscle) or is fed into 
another neuron via its dendrites. 


Certainly, the high-level idea that many such units could be cobbled together with the right con- 
nectivity and right learning algorithm, to produce far more interesting and complex behavior than 
any one neuron alone could express owes to our study of real biological neural systems. 


At the same time, most research in deep learning today draws little direct inspiration in neuro- 
science. We invoke Stuart Russell and Peter Norvig who, in their classic AI text book Artificial In- 
telligence: A Modern Approach (Russell & Norvig, 2016), pointed out that although airplanes might 
have been inspired by birds, ornithology has not been the primary driver of aeronautics innovation 
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for some centuries. Likewise, inspiration in deep learning these days comes in equal or greater 
measure from mathematics, statistics, and computer science. 


Summary 
° Key ingredients in a machine learning model are training data, a loss function, an optimiza- 
tion algorithm, and quite obviously, the model itself. 
e Vectorizing makes everything better (mostly math) and faster (mostly code). 


e Minimizing an objective function and performing maximum likelihood estimation can 
mean the same thing. 


* Linear regression models are neural networks, too. 


Exercises 
1. Assume that we have some data 211,...,%, € R. Our goal is to find a constant b such that 
N,(1; — b)? is minimized. 
1. Find a analytic solution for the optimal value of b. 
2. How does this problem and its solution relate to the normal distribution? 


2. Derive the analytic solution to the optimization problem for linear regression with squared 
error. To keep things simple, you can omit the bias b from the problem (we can do this in 
principled fashion by adding one column to X consisting of all ones). 


1. Write out the optimization problem in matrix and vector notation (treat all the data as 
a single matrix, and all the target values as a single vector). 


2. Compute the gradient of the loss with respect to w. 


3. Find the analytic solution by setting the gradient equal to zero and solving the matrix 
equation. 


4. When might this be better than using stochastic gradient descent? When might this 
method break? 


3. Assume that the noise model governing the additive noise e is the exponential distribution. 
That is, p(e) = 4 exp(—|e|). 


1. Write out the negative log-likelihood of the data under the model — log P(y | X). 
2. Can you find a closed form solution? 


3. Suggest a stochastic gradient descent algorithm to solve this problem. What could pos- 
sibly go wrong (hint: what happens near the stationary point as we keep on updating 
the parameters)? Can you fix this? 


Discussions?” 





% https://discuss.d21.ai/t/40 
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3.2 Linear Regression Implementation from Scratch 


Now that you understand the key ideas behind linear regression, we can begin to work through 
a hands-on implementation in code. In this section, we will implement the entire method from 
scratch, including the data pipeline, the model, the loss function, and the minibatch stochastic 
gradient descent optimizer. While modern deep learning frameworks can automate nearly all of 
this work, implementing things from scratch is the only way to make sure that you really know 
what you are doing. Moreover, when it comes time to customize models, defining our own layers 
or loss functions, understanding how things work under the hood will prove handy. In this section, 
we will rely only on tensors and auto differentiation. Afterwards, we will introduce a more concise 
implementation, taking advantage of bells and whistles of deep learning frameworks. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, np, npx 
import random 

npx.set_np() 


3.2.1 Generating the Dataset 


To keep things simple, we will construct an artificial dataset according to a linear model with 
additive noise. Our task will be to recover this model’s parameters using the finite set of examples 
contained in our dataset. We will keep the data low-dimensional so we can visualize it easily. In 
the following code snippet, we generate a dataset containing 1000 examples, each consisting of 2 
features sampled from a standard normal distribution. Thus our synthetic dataset will be a matrix 
X c R1000x2, 


The true parameters generating our dataset will be w = [2, —3.4]' and b = 4.2, and our synthetic 
labels will be assigned according to the following linear model with the noise term e: 


y =Xw+b-+e. (3.2.1) 


You could think of e as capturing potential measurement errors on the features and labels. We 
will assume that the standard assumptions hold and thus that e obeys a normal distribution with 
mean of 0. To make our problem easy, we will set its standard deviation to 0.01. The following 
code generates our synthetic dataset. 


def synthetic_data(w, b, num_examples): #@save 
""*"Generate y = Xw + b + noise.”"" 
X = np.random.normal(@, 1, (num_examples, len(w))) 
y = np.dot(X, w) + b 
y += np.random.normal(0, 0.01, y.shape) 
return X, y.reshape((-1, 1)) 


true_w = np.array([2, -3.4]) 
true_b = 4.2 
features, labels = synthetic_data(true_w, true_b, 1000) 


Note that each row in features consists of a 2-dimensional data example and that each row in 
labels consists of a 1-dimensional label value (a scalar). 
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print('features:', features[0],'\nlabel:', labels[0]) 


features: [2.2122064 1.1630787] 
label: [4.662078] 


By generating a scatter plot using the second feature features[:, 1] and labels, we can clearly 
observe the linear correlation between the two. 


d21.set_figsize() 
d21.plt.scatter(features[:, (1)].asnumpy(), labels.asnumpy(), 1) 


<matplotlib.collections.PathCollection at 0x7f28883fc490> 


10 


—10 





3.2.2 Reading the Dataset 


Recall that training models consists of making multiple passes over the dataset, grabbing one 
minibatch of examples at a time, and using them to update our model. Since this process is so 
fundamental to training machine learning algorithms, it is worth defining a utility function to 
shuffle the dataset and access it in minibatches. 


In the following code, we define the data_iter function to demonstrate one possible implemen- 
tation of this functionality. The function takes a batch size, a matrix of features, and a vector of 
labels, yielding minibatches of the size batch_size. Each minibatch consists of a tuple of features 
and labels. 


def data_iter(batch_size, features, labels): 

num_examples = len(features) 

indices = list(range(num_examples)) 

# The examples are read at random, in no particular order 

random. shuffle(indices) 

for i in range(0, num_examples, batch_size): 
batch_indices = np.array( 

indices[i: min(i + batch_size, num_examples)]) 

yield features[batch_indices], labels[batch_indices] 


In general, note that we want to use reasonably sized minibatches to take advantage of the GPU 
hardware, which excels at parallelizing operations. Because each example can be fed through our 
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models in parallel and the gradient of the loss function for each example can also be taken in 
parallel, GPUs allow us to process hundreds of examples in scarcely more time than it might take 
to process just a single example. 


To build some intuition, let us read and print the first small batch of data examples. The shape of 
the features in each minibatch tells us both the minibatch size and the number of input features. 
Likewise, our minibatch of labels will have a shape given by batch_size. 


batch_size = 10 


for X, y in data_iter(batch_size, features, labels): 
print(X, ‘\n’, y) 


break 
CE 0.43498218 -0.52985734] 
[ 2.0088325 -0.9185635 ] 
[-1.8785107 1.3769009 ] 
[ 0.31488907 0.03415475] 
[ 0.90336937 -0.38090217] 
[-0.02594555 -0.9746724 ] 
E 0.7727994 0.83015364] 
[-0.31846237 -0.9492751 ] 
[ 2.196302 @.14495121] 
12075805220 mM 2532044] 
[L 6.863387 ] 
[11.329663 ] 
[-4.2252774] 
[ 4.718023 ] 
[ 7.301087 ] 
[ 7.451884 ] 
[ 2.9260201] 
[ 6.796183 J 
[ 8.113025 ] 
[-2.3186932]] 





As we run the iteration, we obtain distinct minibatches successively until the entire dataset has 
been exhausted (try this). While the iteration implemented above is good for didactic purposes, 
it is inefficient in ways that might get us in trouble on real problems. For example, it requires that 
we load all the data in memory and that we perform lots of random memory access. The built-in 
iterators implemented in a deep learning framework are considerably more efficient and they can 
deal with both data stored in files and data fed via data streams. 


3.2.3 Initializing Model Parameters 


Before we can begin optimizing our model's parameters by minibatch stochastic gradient descent, 
we need to have some parameters in the first place. In the following code, we initialize weights 
by sampling random numbers from a normal distribution with mean 0 and a standard deviation 
of 0.01, and setting the bias to 0. 


= np.random.normal(0, 0.01, (2, 1)) 
= np.zeros(1) 
.attach_grad() 
.attach_grad() 


DOS UE 
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After initializing our parameters, our next task is to update them until they fit our data sufficiently 
well. Each update requires taking the gradient of our loss function with respect to the parameters. 
Given this gradient, we can update each parameter in the direction that may reduce the loss. 


Since nobody wants to compute gradients explicitly (this is tedious and error prone), we use au- 
tomatic differentiation, as introduced in Section 2.5, to compute the gradient. 


3.2.4 Defining the Model 


Next, we must define our model, relating its inputs and parameters to its outputs. Recall that to 
calculate the output of the linear model, we simply take the matrix-vector dot product of the input 
features X and the model weights w, and add the offset b to each example. Note that below Xw is 
a vector and bis a scalar. Recall the broadcasting mechanism as described in Section 2.1.3. When 
we add a vector and a scalar, the scalar is added to each component of the vector. 


def linreg(X, w, b): #@save 
"""The linear regression model.”"” 
return np.dot(X, w) + b 


3.2.5 Defining the Loss Function 


Since updating our model requires taking the gradient of our loss function, we ought to define the 
loss function first. Here we will use the squared loss function as described in Section 3.1. In the 
implementation, we need to transform the true value y into the predicted value's shape y_hat. The 
result returned by the following function will also have the same shape as y_hat. 


def squared_loss(y_hat, y): #@save 
"™ "Squared les. PY 
return (y_hat - y.reshape(y_hat.shape))**2 / 2 


3.2.6 Defining the Optimization Algorithm 


As we discussed in Section 3.1, linear regression has a closed-form solution. However, this is not 
a book about linear regression: itis a book about deep learning. Since none of the other models 
that this book introduces can be solved analytically, we will take this opportunity to introduce your 
first working example of minibatch stochastic gradient descent. 


At each step, using one minibatch randomly drawn from our dataset, we will estimate the gradient 
of the loss with respect to our parameters. Next, we will update our parameters in the direction 
that may reduce the loss. The following code applies the minibatch stochastic gradient descent 
update, given a set of parameters, a learning rate, and a batch size. The size of the update step is 
determined by the learning rate 1r. Because our loss is calculated as a sum over the minibatch of 
examples, we normalize our step size by the batch size (batch_size), so that the magnitude of a 
typical step size does not depend heavily on our choice of the batch size. 


def sgd(params, lr, batch_size): #@save 
"""Minibatch stochastic gradient descent. 
for param in params: 
param[:] = param - lr * param.grad / batch_size 


nnn 
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3.2.7 Training 


Now that we have all of the parts in place, we are ready to implement the main training loop. It 
is crucial that you understand this code because you will see nearly identical training loops over 
and over again throughout your career in deep learning. 


In each iteration, we will grab a minibatch of training examples, and pass them through our model 
to obtain a set of predictions. After calculating the loss, we initiate the backwards pass through 
the network, storing the gradients with respect to each parameter. Finally, we will call the opti- 
mization algorithm sgd to update the model parameters. 


In summary, we will execute the following loop: 
+ Initialize parameters (w, b) 
+ Repeat until done 
- Compute gradient g — Ow.) TB] ies (x, yO, w, b) 
- Update parameters (w, b) — (w, b) — ng 


In each epoch, we will iterate through the entire dataset (using the data_iter function) once pass- 
ing through every example in the training dataset (assuming that the number of examples is di- 
visible by the batch size). The number of epochs num_epochs and the learning rate 1r are both 
hyperparameters, which we set here to 3 and 0.03, respectively. Unfortunately, setting hyperpa- 
rameters is tricky and requires some adjustment by trial and error. We elide these details for now 
but revise them later in Chapter 11. 


lr = 0.03 
num_epochs = 3 

net = linreg 

loss = squared_loss 


for epoch in range(num_epochs): 
for X, y in data_iter(batch_size, features, labels): 
with autograd.record(): 
1 = loss(net(X, w, b), y) # Minibatch loss in ‘X* and ‘y* 
# Because '1* has a shape (‘batch_size*‘, 1) and is not a scalar 
# variable, the elements in ‘1* are added together to obtain a new 
# variable, on which gradients with respect to ['w', ‘b‘] are computed 
1.backward() 
sgd([w, b], lr, batch_size) + Update parameters using their gradient 
train_l = loss(net(features, w, b), labels) 
print(f’epoch {epoch + 1}, loss (float(train_1.mean()):f)3') 


epoch 1, loss 0.024890 
epoch 2, loss 0.000089 
epoch 3, loss 0.000051 


In this case, because we synthesized the dataset ourselves, we know precisely what the true pa- 
rameters are. Thus, we can evaluate our success in training by comparing the true parameters 
with those that we learned through our training loop. Indeed they turn out to be very close to 
each other. 
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print(f’error in estimating w: {true_w - w.reshape(true_w. shape) }’) 
print(f’error in estimating b: {true_b - b}’) 


error in estimating w: [ 0.00055313 -0.00041389] 
error in estimating b: [0.00010967] 


Note that we should not take it for granted that we are able to recover the parameters perfectly. 
However, in machine learning, we are typically less concerned with recovering true underlying 
parameters, and more concerned with parameters that lead to highly accurate prediction. For- 
tunately, even on difficult optimization problems, stochastic gradient descent can often find re- 
markably good solutions, owing partly to the fact that, for deep networks, there exist many con- 
figurations of the parameters that lead to highly accurate prediction. 


Summary 
e We saw how a deep network can be implemented and optimized from scratch, using just 
tensors and auto differentiation, without any need for defining layers or fancy optimizers. 


* This section only scratches the surface of what is possible. In the following sections, we will 
describe additional models based on the concepts that we have just introduced and learn 
how to implement them more concisely. 


Exercises 
1. What would happen if we were to initialize the weights to zero. Would the algorithm still 
work? 


2. Assume that you are Georg Simon Ohm” trying to come up with a model between voltage 
and current. Can you use auto differentiation to learn the parameters of your model? 


3. Can you use Planck's Law”! to determine the temperature of an object using spectral energy 
density? 


4. What are the problems you might encounter if you wanted to compute the second deriva- 
tives? How would you fix them? 


5. Why is the reshape function needed in the squared_loss function? 
6. Experiment using different learning rates to find out how fast the loss function value drops. 


7. If the number of examples cannot be divided by the batch size, what happens to the 
data_iter function’s behavior? 


Discussions”? 





% https://en.wikipedia.org/wiki/Georg_Ohm 
% https://en.wikipedia.org/wiki/Planck%27s_law 
% https://discuss.d21.ai/t/42 
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3.3 Concise Implementation of Linear Regression 


Broad and intense interest in deep learning for the past several years has inspired companies, 
academics, and hobbyists to develop a variety of mature open source frameworks for automating 
the repetitive work of implementing gradient-based learning algorithms. In Section 3.2, we relied 
only on (i) tensors for data storage and linear algebra; and (ii) auto differentiation for calculat- 
ing gradients. In practice, because data iterators, loss functions, optimizers, and neural network 
layers are so common, modern libraries implement these components for us as well. 


In this section, we will show you how to implement the linear regression model from Section 3.2 
concisely by using high-level APIs of deep learning frameworks. 


3.3.1 Generating the Dataset 


To start, we will generate the same dataset as in Section 3.2. 


from d21 import mxnet as d21 
from mxnet import autograd, gluon, np, npx 
npx.set_np() 


true_w = np.array([2, -3.4]) 
true_b = 4.2 
features, labels = d21.synthetic_data(true_w, true_b, 1000) 


3.3.2 Reading the Dataset 


Rather than rolling our own iterator, we can call upon the existing API in a framework to read 
data. We pass in features and labels as arguments and specify batch_size when instantiating 
a data iterator object. Besides, the boolean value is_train indicates whether or not we want the 
data iterator object to shuffle the data on each epoch (pass through the dataset). 


def load_array(data_arrays, batch_size, is_train=True): #@save 
"""Construct a Gluon data iterator.””” 
dataset = gluon.data.ArrayDataset(*data_arrays) 
return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train) 


batch_size = 10 
data_iter = load_array((features, labels), batch_size) 


Now we can use data_iter in much the same way as we called the data_iter function in Section 
3.2. To verify that it is working, we can read and print the first minibatch of examples. Comparing 
with Section 3.2, here we use iter to construct a Python iterator and use next to obtain the first 
item from the iterator. 


next(iter(data_iter)) 
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[array([[-0.9397334 , 1.4642214 1, 
IE ANT A ame 
[-0.99196637, 0.7509554 ], 
[ 0.03087033, -1.2529644 1, 
[ 0.3472356 , -0.325225261, 
[-0.08807065, 0.692164841, 
[-0.34623215, 0.2672201 1, 
[ 1.7303463 , -0.4695727 1, 
[ 0.05918283, 1.1066241 1, 
[-0.44527945, -0.91978157]]), 

array([[-2.6656873 ], 
[ 3.5531938 J, 
[-0. 33856755], 
LOL T; 
[ 6.0134516 1, 
[ 1.6832367 1, 
[ 2.5925815 1, 
[ 9.267116 1], 
[ 0.5607804 1], 
[ 6.4434476 11)] 





3.3.3 Defining the Model 


When we implemented linear regression from scratch in Section 3.2, we defined our model pa- 
rameters explicitly and coded up the calculations to produce output using basic linear algebra 
operations. You should know how to do this. But once your models get more complex, and once 
you have to do this nearly every day, you will be glad for the assistance. The situation is similar 
to coding up your own blog from scratch. Doing it once or twice is rewarding and instructive, but 
you would be a lousy web developer if every time you needed a blog you spent a month reinventing 
the wheel. 


For standard operations, we can use a framework's predefined layers, which allow us to focus espe- 
cially on the layers used to construct the model rather than having to focus on the implementation. 
We will first define a model variable net, which will refer to an instance of the Sequential class. 
The Sequential class defines a container for several layers that will be chained together. Given 
input data, a Sequential instance passes it through the first layer, in turn passing the output as 
the second layer's input and so forth. In the following example, our model consists of only one 
layer, so we do not really need Sequential. But since nearly all of our future models will involve 
multiple layers, we will use it anyway just to familiarize you with the most standard workflow. 


Recall the architecture of a single-layer network as shown in Fig. 3.1.2. The layer is said to be fully- 
connected because each of its inputs is connected to each of its outputs by means of a matrix-vector 
multiplication. 


In Gluon, the fully-connected layer is defined in the Dense class. Since we only want to generate a 
single scalar output, we set that number to 1. 


It is worth noting that, for convenience, Gluon does not require us to specify the input shape for 
each layer. So here, we do not need to tell Gluon how many inputs go into this linear layer. When 
we first try to pass data through our model, e.g., when we execute net (X) later, Gluon will auto- 
matically infer the number of inputs to each layer. We will describe how this works in more detail 
later. 
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# ‘nn* is an abbreviation for neural networks 
from mxnet.gluon import nn 

net = nn.Sequential() 

net .add(nn.Dense(1)) 


3.3.4 Initializing Model Parameters 


Before using net, we need to initialize the model parameters, such as the weights and bias in the 
linear regression model. Deep learning frameworks often have a predefined way to initialize the 
parameters. Here we specify that each weight parameter should be randomly sampled from a nor- 
mal distribution with mean 0 and standard deviation 0.01. The bias parameter will be initialized 
to zero. 


We will import the initializer module from MXNet. This module provides various methods 
for model parameter initialization. Gluon makes init available as a shortcut (abbreviation) to 
access the initializer package. We only specify how to initialize the weight by calling init. 
Normal (sigma=0.01). Bias parameters are initialized to zero by default. 


from mxnet import init 
net.initialize(init.Normal(sigma=0.01)) 


The code above may look straightforward but you should note that something strange is happening 
here. We are initializing parameters for a network even though Gluon does not yet know how 
many dimensions the input will have! It might be 2 as in our example or it might be 2000. Gluon 
lets us get away with this because behind the scene, the initialization is actually deferred. The 
real initialization will take place only when we for the first time attempt to pass data through the 
network. Just be careful to remember that since the parameters have not been initialized yet, we 
cannot access or manipulate them. 


3.3.5 Defining the Loss Function 


In Gluon, the loss module defines various loss functions. In this example, we will use the Gluon 
implementation of squared loss (L2Loss). 


loss = gluon.loss.L2Loss() 


3.3.6 Defining the Optimization Algorithm 


Minibatch stochastic gradient descent is a standard tool for optimizing neural networks and thus 
Gluon supports it alongside a number of variations on this algorithm through its Trainer class. 
When we instantiate Trainer, we will specify the parameters to optimize over (obtainable from 
our model net via net.collect_params()), the optimization algorithm we wish to use (sgd), and 
a dictionary of hyperparameters required by our optimization algorithm. Minibatch stochastic 
gradient descent just requires that we set the value learning_rate, which is set to 0.03 here. 


from mxnet import gluon 
trainer = gluon.Trainer(net.collect_params(), 'sgd', {’learning_rate’: 0.03)) 
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3.3.7 Training 


You might have noticed that expressing our model through high-level APIs of a deep learning 
framework requires comparatively few lines of code. We did not have to individually allocate 
parameters, define our loss function, or implement minibatch stochastic gradient descent. Once 
we start working with much more complex models, advantages of high-level APIs will grow con- 
siderably. However, once we have all the basic pieces in place, the training loop itself is strikingly 
similar to what we did when implementing everything from scratch. 


To refresh your memory: for some number of epochs, we will make a complete pass over the 
dataset (train_data), iteratively grabbing one minibatch of inputs and the corresponding ground- 
truth labels. For each minibatch, we go through the following ritual: 


e Generate predictions by calling net (X) and calculate the loss 1 (the forward propagation). 
e Calculate gradients by running the backpropagation. 
* Update the model parameters by invoking our optimizer. 


For good measure, we compute the loss after each epoch and print it to monitor progress. 


num_epochs = 3 
for epoch in range(num_epochs): 
for X, y in data_iter: 
with autograd.record(): 
1 = loss(net(X), y) 
1. backward() 
trainer.step(batch_size) 
1 = loss(net(features), labels) 
print(f’epoch {epoch + 1}, loss {1.mean().asnumpy():f}’) 


epoch 1, loss 0.025045 
epoch 2, loss 0.000088 
epoch 3, loss 0.000051 


Below, we compare the model parameters learned by training on finite data and the actual param- 
eters that generated our dataset. To access parameters, we first access the layer that we need from 
net and then access that layer’s weights and bias. As in our from-scratch implementation, note 
that our estimated parameters are close to their ground-truth counterparts. 


w = net[0].weight.data() 

print(f’error in estimating w: {true_w - w.reshape(true_w. shape) }’) 
b = net[0].bias.data() 

print(f’error in estimating b: {true_b - b}’) 


error in estimating w: [ 6.3693523e-04 -5.9366226e-05] 
error in estimating b: [0.00053215] 
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Summary 


Using Gluon, we can implement models much more concisely. 


In Gluon, the data module provides tools for data processing, the nn module defines a large 
number of neural network layers, and the loss module defines many common loss func- 
tions. 


MXNet's module initializer provides various methods for model parameter initialization. 


Dimensionality and storage are automatically inferred, but be careful not to attempt to ac- 
cess parameters before they have been initialized. 


Exercises 


1. If we replace 1 = loss(output, y) with 1 = loss(output, y).mean(), we need to change 
trainer.step(batch_size) to trainer.step(1) for the code to behave identically. Why? 


2. Review the MXNet documentation to see what loss functions and initialization methods are 
provided in the modules gluon.loss and init. Replace the loss by Huber’s loss. 
3. How do you access the gradient of dense.weight? 


Discussions”? 


3.4 Softmax Regression 


In Section 3.1, we introduced linear regression, working through implementations from scratch 
in Section 3.2 and again using high-level APIs of a deep learning framework in Section 3.3 to do 
the heavy lifting. 


Regression is the hammer we reach for when we want to answer how much? or how many? ques- 
tions. If you want to predict the number of dollars (price) at which a house will be sold, or the 
number of wins a baseball team might have, or the number of days that a patient will remain 
hospitalized before being discharged, then you are probably looking for a regression model. 


In practice, we are more often interested in classification: asking not “how much” but “which one”: 
* Does this email belong in the spam folder or the inbox? 
e Is this customer more likely to sign up or not to sign up for a subscription service? 
* Does this image depict a donkey, a dog, a cat, or a rooster? 
* Which movie is Aston most likely to watch next? 


Colloquially, machine learning practitioners overload the word classification to describe two subtly 
different problems: (i) those where we are interested only in hard assignments of examples to 
categories (classes); and (ii) those where we wish to make soft assignments, i.e., to assess the 
probability that each category applies. The distinction tends to get blurred, in part, because often, 
even when we only care about hard assignments, we still use models that make soft assignments. 





5 https://discuss.d21.ai/t/44 
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3.4.1 Classification Problem 


To get our feet wet, let us start off with a simple image classification problem. Here, each input 
consists of a 2 x 2 grayscale image. We can represent each pixel value with a single scalar, giving 
us four features 11, 12,13, 14. Further, let us assume that each image belongs to one among the 


” 


categories “cat”, “chicken”, and “dog”. 


Next, we have to choose how to represent the labels. We have two obvious choices. Per- 
haps the most natural impulse would be to choose y € {1,2,3}, where the integers repre- 
sent (dog, cat, chicken] respectively. This is a great way of storing such information on a com- 
puter. If the categories had some natural ordering among them, say if we were trying to predict 
(baby, toddler, adolescent, young adult, adult, geriatric}, then it might even make sense to cast 
this problem as regression and keep the labels in this format. 


But general classification problems do not come with natural orderings among the classes. For- 
tunately, statisticians long ago invented a simple way to represent categorical data: the one-hot 
encoding. A one-hot encoding is a vector with as many components as we have categories. The 
component corresponding to particular instance’s category is set to 1 and all other components 
are set to 0. In our case, a label y would be a three-dimensional vector, with (1, 0, 0) corresponding 
to “cat”, (0, 1,0) to “chicken”, and (0, 0, 1) to “dog”: 


y = {(1,0, 0), (0, 1,0), (0,0, 1)}. (3.4.1) 


3.4.2 Network Architecture 


In order to estimate the conditional probabilities associated with all the possible classes, we need 
a model with multiple outputs, one per class. To address classification with linear models, we will 
need as many affine functions as we have outputs. Each output will correspond to its own affine 
function. In our case, since we have 4 features and 3 possible output categories, we will need 12 
scalars to represent the weights (w with subscripts), and 3 scalars to represent the biases (b with 
subscripts). We compute these three logits, 01, 02, and 03, for each input: 


01 = 21W11 + 2212 + 23W13 + 24W14 + b1, 


02 = 21W921 + 22992 + ©ZW23 + L4W24 + bo, (3.4.2) 








03 = 21W31 + LoW32 + ©3W33 + T4W3za4 + b3. 


We can depict this calculation with the neural network diagram shown in Fig. 3.4.1. Just as in lin- 
ear regression, softmax regression is also a single-layer neural network. And since the calculation 
of each output, 01, 02, and 03, depends on all inputs, 71, x2, 13, and x4, the output layer of softmax 
regression can also be described as fully-connected layer. 


Output layer 


Input layer 





Fig. 3.4.1: Softmax regression is a single-layer neural network. 


To express the model more compactly, we can use linear algebra notation. In vector form, we 
arrive ato = Wx + b, a form better suited both for mathematics, and for writing code. Note that 
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we have gathered all of our weights into a 3 x 4 matrix and that for features of a given data example 
x, our outputs are given by a matrix-vector product of our weights by our input features plus our 
biases b. 


3.4.3 Parameterization Cost of Fully-Connected Layers 


As we will see in subsequent chapters, fully-connected layers are ubiquitous in deep learning. 
However, as the name suggests, fully-connected layers are fully connected with potentially many 
learnable parameters. Specifically, for any fully-connected layer with d inputs and q outputs, the 
parameterization cost is O(dq), which can be prohibitively high in practice. Fortunately, this cost 
of transforming d inputs into q outputs can be reduced to o(, where the hyperparameter n 
can be flexibly specified by us to balance between parameter saving and model effectiveness in 
real-world applications (Zhang et al., 2021). 


3.4.4 Softmax Operation 


The main approach that we are going to take here is to interpret the outputs of our model as proba- 
bilities. We will optimize our parameters to produce probabilities that maximize the likelihood of 
the observed data. Then, to generate predictions, we will set a threshold, for example, choosing 
the label with the maximum predicted probabilities. 


Put formally, we would like any output y; to be interpreted as the probability that a given item 
belongs to class j. Then we can choose the class with the largest output value as our prediction 
argmax; yj. For example, if 41, Ya, and ĝz are 0.1, 0.8, and 0.1, respectively, then we predict cate- 
gory 2, which (in our example) represents “chicken”. 


You might be tempted to suggest that we interpret the logits o directly as our outputs of interest. 
However, there are some problems with directly interpreting the output of the linear layer as a 
probability. On one hand, nothing constrains these numbers to sum to 1. On the other hand, 
depending on the inputs, they can take negative values. These violate basic axioms of probability 
presented in Section 2.6 


To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be 
nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model 
to estimate faithfully probabilities. Of all instances when a classifier outputs 0.5, we hope that half 
of those examples will actually belong to the predicted class. This is a property called calibration. 


The softmax function, invented in 1959 by the social scientist R. Duncan Luce in the context of 
choice models, does precisely this. To transform our logits such that they become nonnegative and 
sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit 
(ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1): 


exp(0;) 
Y y explop) 


It is easy to see 1 +2 +43 = 1 with 0 < ĝ; < 1 forall j. Thus, y is a proper probability distribution 
whose element values can be interpreted accordingly. Note that the softmax operation does not 
change the ordering among the logits o, which are simply the pre-softmax values that determine 
the probabilities assigned to each class. Therefore, during prediction we can still pick out the most 
likely class by 


y = softmax(o) where 4; = (3.4.3) 


argmax jj; = argmax oj. (3.4.4) 
j j 
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Although softmax is a nonlinear function, the outputs of softmax regression are still determined 
by an affine transformation of input features; thus, softmax regression is a linear model. 


3.4.5 Vectorization for Minibatches 


To improve computational efficiency and take advantage of GPUs, we typically carry out vector 
calculations for minibatches of data. Assume that we are given a minibatch X of examples with 
feature dimensionality (number of inputs) d and batch size n. Moreover, assume that we have q 
categories in the output. Then the minibatch features X are in R”*4, weights W e R?*4, and the 
bias satisfies b € R!*?, 


O = XW +b, 


2 (3.4.5) 

Y = softmax(0). 
This accelerates the dominant operation into a matrix-matrix product XW vs. the matrix-vector 
products we would be executing if we processed one example at a time. Since each row in X rep- 
resents a data example, the softmax operation itself can be computed rowwise: for each row of 
O, exponentiate all entries and then normalize them by the sum. Triggering broadcasting during 
the summation XW + b in (3.4.5), both the minibatch logits O and output probabilities Y are n x q 
matrices. 


3.4.6 Loss Function 


Next, we need a loss function to measure the quality of our predicted probabilities. We will rely 
on maximum likelihood estimation, the very same concept that we encountered when providinga 
probabilistic justification for the mean squared error objective in linear regression (Section 3.1.3). 


Log-Likelihood 


The softmax function gives us a vector y, which we can interpret as estimated conditional prob- 
abilities of each class given any input x, e.g., ĝi = P(y = cat | x). Suppose that the entire dataset 
{X,Y} has n examples, where the example indexed by i consists of a feature vector x“) and a one- 
hot label vector y). We can compare the estimates with reality by checking how probable the 
actual classes are according to our model, given the features: 


PY |X) = | [ Py | x). (3.4.6) 


According to maximum likelihood estimation, we maximize P(Y | X), which is equivalent to min- 
imizing the negative log-likelihood: 


n 


— log P(Y | X) = y — log P(y® | x) El © 70), (3.4.7) 


i=1 


where for any pair of label y and model prediction y over q classes, the loss function l is 


q 
-X ys log 9). (3.4.8) 
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For reasons explained later on, the loss function in (3.4.8) is commonly called the cross-entropy loss. 
Since y is a one-hot vector of length q, the sum over all its coordinates j vanishes for all but one 
term. Since all y; are predicted probabilities, their logarithm is never larger than 0. Consequently, 
the loss function cannot be minimized any further if we correctly predict the actual label with 
certainty, i.e., if the predicted probability P(y | x) = 1 for the actual label y. Note that this is 
often impossible. For example, there might be label noise in the dataset (some examples may be 
mislabeled). It may also not be possible when the input features are not sufficiently informative 
to classify every example perfectly. 


Softmax and Derivatives 
Since the softmax and the corresponding loss are so common, it is worth understanding a bit 


better how it is computed. Plugging (3.4.3) into the definition of the loss in (3.4.8) and using the 
definition of the softmax we obtain: 


explo 
O) CM p( j) 
j=1 


Xj- exp(ok) exp(0;,) 
q q 
= 20 ¡log > exp(0;) )- y Yjoj (3.4.9) 
+1 k=1 j=1 
q 
= les Y exp(0) — Y yjoj. 
k=1 j=l 


To understand a bit better what is going on, consider the derivative with respect to any logit oj. 
We get 
exp(0;) 


Oo, | Ly — 
¿Uy, y) o) 





yj = softmax(0); — yj. (3.4.10) 


In other words, the derivative is the difference between the probability assigned by our model, 
as expressed by the softmax operation, and what actually happened, as expressed by elements in 
the one-hot label vector. In this sense, it is very similar to what we saw in regression, where the 
gradient was the difference between the observation y and estimate y. This is not coincidence. 
In any exponential family (see the online appendix on distributions”*) model, the gradients of 
the log-likelihood are given by precisely this term. This fact makes computing gradients easy in 
practice. 


Cross-Entropy Loss 


Now consider the case where we observe not just a single outcome but an entire distribution over 
outcomes. We can use the same representation as before for the labely. The only difference is that 
rather than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability 
vector, say (0.1, 0.2, 0.7). The math that we used previously to define the loss / in (3.4.8) still works 
out fine, just that the interpretation is slightly more general. It is the expected value of the loss fora 
distribution over labels. This loss is called the cross-entropy loss and it is one of the most commonly 
used losses for classification problems. We can demystify the name by introducing just the basics 
of information theory. If you wish to understand more details of information theory, you may 
further refer to the online appendix on information theory” 





% https://d21.ai/chapter_appendix-mathematics-for-deep-learning/distributions. html 
5 https://d21.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html 
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3.4.7 Information Theory Basics 


Information theory deals with the problem of encoding, decoding, transmitting, and manipulating 
information (also known as data) in as concise form as possible. 


Entropy 


The central idea in information theory is to quantify the information content in data. This quantity 
places a hard limit on our ability to compress the data. In information theory, this quantity is 
called the entropy of a distribution P, and itis captured by the following equation: 


H[P] = 5 =P(j) log P(j). (3.4.11) 

J 
One of the fundamental theorems of information theory states that in order to encode data drawn 
randomly from the distribution P, we need at least H|P] “nats” to encode it. If you wonder what 
a “nat” is, itis the equivalent of bit but when using a code with base e rather than one with base 2. 


Thus, one nat is Eo = 1.44 bit. 


Surprisal 


You might be wondering what compression has to do with prediction. Imagine that we have a 
stream of data that we want to compress. If it is always easy for us to predict the next token, then 
this data is easy to compress! Take the extreme example where every token in the stream always 
takes the same value. That is a very boring data stream! And not only it is boring, but it is also 
easy to predict. Because they are always the same, we do not have to transmit any information to 
communicate the contents of the stream. Easy to predict, easy to compress. 


However if we cannot perfectly predict every event, then we might sometimes be surprised. Our 
surprise is greater when we assigned an event lower probability. Claude Shannon settled on 
log PU = — log P(j) to quantify one's surprisal at observing an event j having assigned it a (sub- 
jective) probability P(j). The entropy defined in (3.4.11) is then the expected surprisal when one 
assigned the correct probabilities that truly match the data-generating process. 


Cross-Entropy Revisited 


So if entropy is level of surprise experienced by someone who knows the true probability, then you 
might be wondering, what is cross-entropy? The cross-entropy from P to Q, denoted H (P, Q), is 
the expected surprisal of an observer with subjective probabilities Q upon seeing data that were 
actually generated according to probabilities P. The lowest possible cross-entropy is achieved 
when P = Q. In this case, the cross-entropy from P to Q is H(P, P) = H(P). 


In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing 
the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of 
bits) required to communicate the labels. 
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3.4.8 Model Prediction and Evaluation 


After training the softmax regression model, given any example features, we can predict the prob- 
ability of each output class. Normally, we use the class with the highest predicted probability as 
the output class. The prediction is correct if it is consistent with the actual class (label). In the next 
part of the experiment, we will use accuracy to evaluate the model’s performance. This is equal to 
the ratio between the number of correct predictions and the total number of predictions. 


Summary 


* The softmax operation takes a vector and maps it into probabilities. 


e Softmax regression applies to classification problems. It uses the probability distribution of 
the output class in the softmax operation. 


e Cross-entropy is a good measure of the difference between two probability distributions. It 
measures the number of bits needed to encode the data given our model. 


Exercises 
1. We can explore the connection between exponential families and the softmax in some more 
depth. 
1. Compute the second derivative of the cross-entropy loss /(y, y) for the softmax. 


2. Compute the variance of the distribution given by softmax(o) and show that it matches 
the second derivative computed above. 
2. Assume that we have three classes which occur with equal probability, i.e., the probability 
vector is (4, 3, 3). 
1. What is the problem if we try to design a binary code for it? 


2. Can you design a better code? Hint: what happens if we try to encode two independent 
observations? What if we encode n observations jointly? 


3. Softmax is a misnomer for the mapping introduced above (but everyone in deep learning 
uses it). The real softmax is defined as RealSoftMax(a, b) = log(exp(a) + exp(b)). 


1. Prove that RealSoftMax(a, b) > max(a, b). 

2. Prove that this holds for \~'RealSoftMax(Aa, Ab), provided that A > 0. 
3. Show that for A > oo we have \~'RealSoftMax(Aa, Ab) + max(a, b). 
4. What does the soft-min look like? 

5. Extend this to more than two numbers. 


Discussions”? 





5 https://discuss.d21.ai/t/46 
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3.5 The Image Classification Dataset 


One of the widely used dataset for image classification is the MNIST dataset (LeCun et al., 1998). 
While it had a good run as a benchmark dataset, even simple models by today’s standards achieve 
classification accuracy over 95%, making it unsuitable for distinguishing between stronger models 
and weaker ones. Today, MNIST serves as more of sanity checks than as a benchmark. To up the 
ante just a bit, we will focus our discussion in the coming sections on the qualitatively similar, but 
comparatively complex Fashion-MNIST dataset (Xiao et al., 2017), which was released in 2017. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import gluon 
import sys 


d21.use_svg_display() 


3.5.1 Reading the Dataset 


We can download and read the Fashion-MNIST dataset into memory via the build-in functions in 
the framework. 


mnist_train = gluon.data.vision.FashionMNIST(train=True) 
mnist_test = gluon.data.vision.FashionMNIST(train=False) 


Fashion-MNIST consists of images from 10 categories, each represented by 6000 images in the 
training dataset and by 1000 in the test dataset. A test dataset (or test set) is used for evaluating 
model performance and not for training. Consequently the training set and the test set contain 
60000 and 10000 images, respectively. 


len(mist_train), len(mnist_test) 


(60000, 10000) 


The height and width of each input image are both 28 pixels. Note that the dataset consists of 
grayscale images, whose number of channels is 1. For brevity, throughout this book we store the 
shape of any image with height h width w pixels as h x w or (h, w). 


mnist_train[0][0].shape 


(28, 28, 1) 


The images in Fashion-MNIST are associated with the following categories: t-shirt, trousers, 
pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. The following function converts 
between numeric label indices and their names in text. 


def get_fashion_mnist_labels(labels): #@save 
""*"Return text labels for the Fashion-MNIST dataset.”"” 
text_labels = [’t-shirt’, ‘trouser’, ‘pullover’, ‘dress’, ‘coat’, 


(continues on next page) 
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(continued from previous page) 


'sandal', ‘shirt’, 'sneaker', ‘bag’, 'ankle boot’] 
return [text_labels[int(i)] for i in labels] 


We can now create a function to visualize these examples. 


def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5): #@save 

upottaa kistion MA CES 

figsize = (num_cols * scale, num_rows * scale) 

_, axes = d21.p1t.subplots(num_rows, num_cols, figsize=figsize) 

axes = axes.flatten() 

for i, (ax, img) in enumerate(zip(axes, imgs)): 
ax.imshow(img.asnumpy()) 
ax.axes.get_xaxis().set_visible(False) 
ax.axes.get_yaxis().set_visible(False) 
if titles: 

ax.set_title(titles[i]) 
return axes 


Here are the images and their corresponding labels (in text) for the first few examples in the train- 
ing dataset. 


X, y = mnist_train[:18] 


print(X. shape) 
show_images(X.squeeze(axis=-1), 2, 9, titles=get_fashion_mnist_labels(y)); 


(18, 28, 28) 1) 


pullover ankle boot shirt t-shirt dress coat coat sandal coat 


t-shirt ankle boot t-shirt pullover pullover ankle boot dress 


3.5.2 Reading a Minibatch 






To make our life easier when reading from the training and test sets, we use the built-in data 
iterator rather than creating one from scratch. Recall that at each iteration, a data loader reads a 
minibatch of data with size batch_size each time. We also randomly shuffle the examples for the 
training data iterator. 


batch_size = 256 
def get_dataloader_workers(): #@save 


"""Use 4 processes to read the data except for Windows. 
return 0 if sys.platform.startswith('win') else 4 


nnn 


(continues on next page) 
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(continued from previous page) 


# ‘ToTensor* converts the image data from uint8 to 32-bit floating point. It 

# divides all numbers by 255 so that all pixel values are between @ and 1 

transformer = gluon.data.vision.transforms.ToTensor() 

train_iter = gluon.data.DataLoader(mnist_train.transform_first(transformer) , 
batch_size, shuffle=True, 
num_workers=get_dataloader_workers()) 


Let us look at the time it takes to read the training data. 


timer = d21.Timer() 

for X, y in train_iter: 
continue 

f'{timer.stop():.2f} sec’ 


"1.96 sec’ 


3.5.3 Putting All Things Together 


Now we define the load_data_fashion_mnist function that obtains and reads the Fashion-MNIST 
dataset. It returns the data iterators for both the training set and validation set. In addition, it 
accepts an optional argument to resize images to another shape. 


def load_data_fashion_mnist(batch_size, resize=None): #@save 

"""Download the Fashion-MNIST dataset and then load it into memory.""” 

dataset = gluon.data.vision 

trans = [dataset.transforms.ToTensor() ] 

if resize: 

trans.insert(0, dataset.transforms.Resize(resize)) 

trans = dataset.transforms.Compose(trans) 

mnist_train = dataset.FashionMNIST(train=True).transform_first(trans) 

mnist_test = dataset.FashionMNIST(train=False).transform_first(trans) 

return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True, 
num_workers=get_dataloader_workers()), 

gluon.data.DataLoader(mnist_test, batch_size, shuffle=False, 

num_workers=get_dataloader_workers())) 


Below we test the image resizing feature of the load_data_fashion_mnist function by specifying 
the resize argument. 


train_iter, test_iter = load_data_fashion_mnist(32, resize=64) 
for X, y in train_iter: 


print(X.shape, X.dtype, y.shape, y.dtype) 
break 


(32, 1, 64, 64) <class 'numpy.float32'> (32,) <class 'numpy.int32'> 


We are now ready to work with the Fashion-MNIST dataset in the sections that follow. 
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Summary 


e Fashion-MNIST is an apparel classification dataset consisting of images representing 10 cat- 
egories. We will use this dataset in subsequent sections and chapters to evaluate various 
classification algorithms. 


e We store the shape of any image with height h width w pixels as h x w or (h, w). 


e Data iterators are a key component for efficient performance. Rely on well-implemented 
data iterators that exploit high-performance computing to avoid slowing down your training 
loop. 


Exercises 


1. Does reducing the batch_size (for instance, to 1) affect the reading performance? 


2. The data iterator performance is important. Do you think the current implementation is fast 
enough? Explore various options to improve it. 


3. Check out the framework’s online API documentation. Which other datasets are available? 


Discussions?” 


3.6 Implementation of Softmax Regression from Scratch 


Just as we implemented linear regression from scratch, we believe that softmax regression is sim- 
ilarly fundamental and you ought to know the gory details of how to implement it yourself. We 
will work with the Fashion-MNIST dataset, just introduced in Section 3.5, setting up a data iterator 
with batch size 256. 


from d21 import mxnet as d21 

from mxnet import autograd, np, npx, gluon 
from IPython import display 

npx.set_np() 


batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 


3.6.1 Initializing Model Parameters 


As in our linear regression example, each example here will be represented by a fixed-length vec- 
tor. Each example in the raw dataset is a 28 x 28 image. In this section, we will flatten each image, 
treating them as vectors of length 784. In the future, we will talk about more sophisticated strate- 
gies for exploiting the spatial structure in images, but for now we treat each pixel location as just 
another feature. 


Recall that in softmax regression, we have as many outputs as there are classes. Because our 
dataset has 10 classes, our network will have an output dimension of 10. Consequently, our weights 
will constitute a 784 x 10 matrix and the biases will constitute a 1 x 10 row vector. As with linear 





% https://discuss.d21.ai/t/48 
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regression, we will initialize our weights W with Gaussian noise and our biases to take the initial 
value 0. 


num_inputs = 784 
num_outputs = 10 


W = np.random.normal(0, 0.01, (num_inputs, num_outputs)) 
b = np.zeros(num_outputs) 

W.attach_grad() 

b.attach_grad() 


3.6.2 Defining the Softmax Operation 


Before implementing the softmax regression model, let us briefly review how the sum operator 
works along specific dimensions in a tensor, as discussed in Section 2.3.6 and Section 2.3.6. Given 
a matrix X we can sum over all elements (by default) or only over elements in the same axis, i.e., 
the same column (axis 0) or the same row (axis 1). Note that if X is a tensor with shape (2, 3) and we 
sum over the columns, the result will be a vector with shape (3,). When invoking the sum operator, 
we can specify to keep the number of axes in the original tensor, rather than collapsing out the 
dimension that we summed over. This will result in a two-dimensional tensor with shape (1, 3). 


X = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) 
X.sum(@, keepdims=True), X.sum(1, keepdims=True) 


(Crd... Von Dodi), 
array([[ 6.1, 
El.) 


We are now ready to implement the softmax operation. Recall that softmax consists of three steps: 
i) we exponentiate each term (using exp); ii) we sum over each row (we have one row per example 
in the batch) to get the normalization constant for each example; iii) we divide each row by its 
normalization constant, ensuring that the result sums to 1. Before looking at the code, let us 
recall how this looks expressed as an equation: 


_  exp(X;;) 
y exp (Kix) 
The denominator, or normalization constant, is also sometimes called the partition function (and 


its logarithm is called the log-partition function). The origins of that name are in statistical 
physics*? where a related equation models the distribution over an ensemble of particles. 


softmax(X);; (3.6.1) 


def softmax(X): 
X_exp = np.exp(X) 
partition = X_exp.sum(1, keepdims=True) 
return X_exp / partition # The broadcasting mechanism is applied here 


As you can see, for any random input, we turn each element into a non-negative number. More- 
over, each row sums up to 1, as is required for a probability. 





5 https://en.wikipedia. org/wiki/Partition_function_(statistical_mechanics) 
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X = np.random.normal(0, 1, (2, 5)) 
X_prob = softmax(X) 
X_prob, X_prob.sum(1) 


(array([[0.22376052, 0.06659239, 0.06583703, 0.29964197, 0.3441681 ], 
[0.63209665, 0.03179282, 0.194987 , 0.09209415, 0.04902935]]), 
array([1. , 0.99999994])) 


Note that while this looks correct mathematically, we were a bit sloppy in our implementation 
because we failed to take precautions against numerical overflow or underflow due to large or 
very small elements of the matrix. 


3.6.3 Defining the Model 


Now that we have defined the softmax operation, we can implement the softmax regression model. 
The below code defines how the input is mapped to the output through the network. Note that we 
flatten each original image in the batch into a vector using the reshape function before passing 
the data through our model. 


def net(X): 
return softmax(np.dot(X.reshape((-1, W.shape[0@])), W) + b) 


3.6.4 Defining the Loss Function 


Next, we need to implement the cross-entropy loss function, as introduced in Section 3.4. This 
may be the most common loss function in all of deep learning because, at the moment, classifi- 
cation problems far outnumber regression problems. 


Recall that cross-entropy takes the negative log-likelihood of the predicted probability assigned to 
the true label. Rather than iterating over the predictions with a Python for-loop (which tends to 
be inefficient), we can pick all elements by a single operator. Below, we create sample data y_hat 
with 2 examples of predicted probabilities over 3 classes and their corresponding labels y. With y 
we know that in the first example the first class is the correct prediction and in the second example 
the third class is the ground-truth. Using y as the indices of the probabilities in y_hat, we pick the 
probability of the first class in the first example and the probability of the third class in the second 
example. 


y = np.array([0, 2]) 


y_hat = np.array([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]]) 
y_hat[[o, 11, y] 


array([L0.1, 0.51) 


Now we can implement the cross-entropy loss function efficiently with just one line of code. 


def cross_entropy(y_hat, y): 
return - np.log(y_hat[range(len(y_hat)), yl) 


cross_entropy(y_hat, y) 
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array([2.3025851, 0.6931472]) 


3.6.5 Classification Accuracy 


Given the predicted probability distribution y_hat, we typically choose the class with the highest 
predicted probability whenever we must output a hard prediction. Indeed, many applications 
require that we make a choice. Gmail must categorize an email into “Primary”, “Social”, “Updates”, 
or “Forums”. It might estimate probabilities internally, but at the end of the day it has to choose 


one among the classes. 


When predictions are consistent with the label class y, they are correct. The classification ac- 
curacy is the fraction of all predictions that are correct. Although it can be difficult to optimize 
accuracy directly (it is not differentiable), it is often the performance measure that we care most 
about, and we will nearly always report it when training classifiers. 


To compute accuracy we do the following. First, if y_hat is a matrix, we assume that the second 
dimension stores prediction scores for each class. We use argmax to obtain the predicted class by 
the index for the largest entry in each row. Then we compare the predicted class with the ground- 
truth y elementwise. Since the equality operator == is sensitive to data types, we convert y_hat's 
data type to match that of y. The result is a tensor containing entries of 0 (false) and 1 (true). 
Taking the sum yields the number of correct predictions. 


def accuracy(y_hat, y): #@save 
"""Compute the number of correct predictions. 
if len(y_hat.shape) > 1 and y_hat.shape[1] > 1: 
y_hat = y_hat.argmax(axis=1) 
cmp = y_hat.astype(y.dtype) == y 
return float(cmp.astype(y.dtype) .sum()) 


nnn 


We will continue to use the variables y_hat and y defined before as the predicted probability dis- 
tributions and labels, respectively. We can see that the first example’s prediction class is 2 (the 
largest element of the row is 0.6 with the index 2), which is inconsistent with the actual label, 0. 
The second example’s prediction class is 2 (the largest element of the row is 0.5 with the index of 
2), which is consistent with the actual label, 2. Therefore, the classification accuracy rate for these 
two examples is 0.5. 


accuracy(y_hat, y) / len(y) 


0.5 


Similarly, we can evaluate the accuracy for any model net on a dataset that is accessed via the data 
iterator data_iter. 


def evaluate_accuracy(net, data_iter): #@save 
"""Compute the accuracy for a model on a dataset. 
metric = Accumulator(2) # No. of correct predictions, no. of predictions 
for X, y in data_iter: 
metric.add(accuracy(net(X), y), y.size) 
return metric[0] / metric[1] 


nnn 
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Here Accumulator is a utility class to accumulate sums over multiple variables. In the above eval- 
uate_accuracy function, we create 2 variables in the Accumulator instance for storing both the 
number of correct predictions and the number of predictions, respectively. Both will be accumu- 
lated over time as we iterate over the dataset. 


class Accumulator: #@save 
"""For accumulating sums over ‘n* variables. 
def __init__(self, n): 
self.data = [0.0] x n 


nun 


def add(self, xargs): 
self.data = [a + float(b) for a, b in zip(self.data, args)] 


def reset(self): 
self.data = [0.0] * len(self.data) 


def __getitem__(self, idx): 
return self.data[idx] 


Because we initialized the net model with random weights, the accuracy of this model should be 
close to random guessing, i.e., 0.1 for 10 classes. 


evaluate_accuracy(net, test_iter) 


0.0811 


3.6.6 Training 


The training loop for softmax regression should look strikingly familiar if you read through our 
implementation of linear regression in Section 3.2. Here we refactor the implementation to make 
it reusable. First, we define a function to train for one epoch. Note that updater is a general 
function to update the model parameters, which accepts the batch size as an argument. It can be 
either a wrapper of the d21.sgd function or a framework’s built-in optimization function. 


def train_epoch_ch3(net, train_iter, loss, updater): #@save 
"""Train a model within one epoch (defined in Chapter 3). 
# Sum of training loss, sum of training accuracy, no. of examples 
metric = Accumulator (3) 
if isinstance(updater, gluon.Trainer): 
updater = updater.step 
for X, y in train_iter: 
# Compute gradients and update parameters 
with autograd.record(): 
y_hat = net(X) 
1 = loss(y_hat, y) 
1.backward() 
updater (X. shape[Q]) 
metric.add(float(1.sum()), accuracy(y_hat, y), y.size) 
# Return training loss and training accuracy 
return metric[0] / metric[2], metric[1] / metric[2] 


nnn 


Before showing the implementation of the training function, we define a utility class that plot data 
in animation. Again, it aims to simplify code in the rest of the book. 





3.6. Implementation of Softmax Regression from Scratch 121 


class Animator: #@save 
"""For plotting data in animation. 
def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None, 
ylim=None, xscale='linear’, yscale='linear', 
fmts=('-', 'm--*, ’g-.', 'r:'), nrows=1, ncols=1, 
figsize=(3.5, 2.5)): 
# Incrementally plot multiple lines 
if legend is None: 
legend = [] 
d21.use_svg_display() 
self.fig, self.axes = d21.plt.subplots(nrows, ncols, figsize=figsize) 
if nrows * ncols == 
self.axes = [self.axes, ] 
# Use a lambda function to capture arguments 
self.config_axes = lambda: d21.set_axes( 
self.axeslQ], xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 
self.X, self.Y, self.fmts = None, None, fmts 


nnn 


def add(self, x, y): 
# Add multiple data points into the figure 


if not hasattr(y, ”__len__"): 
y = [y] 

n = len(y) 

inot hasatth O “len. 


x = [x] *n 
if not self.X: 
self.X = [[] for 
if not self.Y: 
self.Y = [[] for _ in range(n)] 
for i, (a, b) in enumerate(zip(x, y)): 
if a is not None and b is not None: 
self .X[i].append(a) 
self .Y[i].append(b) 
self.axes[0].cla() 
for x, y, fmt in zip(self.X, self.Y, self.fmts): 
self .axes[01.plot(x, y, fmt) 
self.config_axes() 
display.display(self.fig) 
display.clear_output (wait=True) 


in range(n) ] 


The following training function then trains a model net on a training dataset accessed via 
train_iter for multiple epochs, which is specified by num_epochs. At the end of each epoch, the 
model is evaluated on a testing dataset accessed via test_iter. We will leverage the Animator class 
to visualize the training progress. 


def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater): #@save 
"""Train a model (defined in Chapter 3).””” 
animator = Animator(xlabel='’epoch’, xlim=[1, num_epochs], ylim=[0.3, 0.9], 
legend=['train loss', 'train acc’, ‘test acc’]) 
for epoch in range(num_epochs): 
train_metrics = train_epoch_ch3(net, train_iter, loss, updater) 
test_acc = evaluate_accuracy(net, test_iter) 
animator.add(epoch + 1, train_metrics + (test_acc,)) 
train_loss, train_acc = train_metrics 
assert train_loss < @.5, train_loss 


(continues on next page) 
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(continued from previous page) 


assert train_acc <= 1 and train_acc > 0.7, train_acc 
assert test_acc <= 1 and test_acc > 0.7, test_acc 


As an implementation from scratch, we use the minibatch stochastic gradient descent defined in 
Section 3.2 to optimize the loss function of the model with a learning rate 0.1. 


li = (9,1 


def updater(batch_size): 
return d21.sgd([W, b], lr, batch_size) 


Now we train the model with 10 epochs. Note that both the number of epochs (num_epochs), and 
learning rate (1r) are adjustable hyperparameters. By changing their values, we may be able to 
increase the classification accuracy of the model. 


num_epochs = 10 
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater) 


pa pae A CA e e, E Te 
ert ines Se 


—— train loss 
=== train acc 
—-- test acc 





3.6.7 Prediction 


Now that training is complete, our model is ready to classify some images. Given a series of im- 
ages, we will compare their actual labels (first line of text output) and the predictions from the 
model (second line of text output). 


def predict_ch3(net, test_iter, n=6): #@save 
"""Predict labels (defined in Chapter 3).””"” 
for X, y in test_iter: 
break 
trues = d21.get_fashion_mnist_labels(y) 
preds = d21.get_fashion_mnist_labels(net(X) .argmax(axis=1)) 
titles = [true + '\n’ + pred for true, pred in zip(trues, preds) ] 
d21.show_images(X[0:n].reshape((n, 28, 28)), 1, n, titles=titles[0:n]) 


predict_ch3(net, test_iter) 
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t-shirt trouser pullover pullover dress pullover 
t-shirt trouser pullover shirt coat shirt 





Summary 


e With softmax regression, we can train models for multiclass classification. 


e The training loop of softmax regression is very similar to that in linear regression: retrieve 


and read data, define models and loss functions, then train models using optimization algo- 
rithms. As you will soon find out, most common deep learning models have similar training 
procedures. 


Exercises 


. In this section, we directly implemented the softmax function based on the mathematical 


definition of the softmax operation. What problems might this cause? Hint: try to calculate 
the size of exp(50). 


. The function cross_entropy in this section was implemented according to the definition of 


the cross-entropy loss function. What could be the problem with this implementation? Hint: 
consider the domain of the logarithm. 


. What solutions you can think of to fix the two problems above? 


. Is it always a good idea to return the most likely label? For example, would you do this for 


medical diagnosis? 


. Assume that we want to use softmax regression to predict the next word based on some 


features. What are some problems that might arise from a large vocabulary? 


Discussions”? 


3.7 Concise Implementation of Softmax Regression 


Just as high-level APIs of deep learning frameworks made it much easier to implement linear re- 
gression in Section 3.3, we will find it similarly (or possibly more) convenient for implementing 
classification models. Let us stick with the Fashion-MNIST dataset and keep the batch size at 256 
as in Section 3.6. 


from d21 import mxnet as d21 

from mxnet import gluon, init, npx 
from mxnet.gluon import nn 
npx.set_np() 





2 https://discuss.d21.ai/t/50 
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batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 


3.7.1 Initializing Model Parameters 


As mentioned in Section 3.4, the output layer of softmax regression is a fully-connected layer. 
Therefore, to implement our model, we just need to add one fully-connected layer with 10 outputs 
to our Sequential. Again, here, the Sequential is not really necessary, but we might as well form 
the habit since it will be ubiquitous when implementing deep models. Again, we initialize the 
weights at random with zero mean and standard deviation 0.01. 


net = nn.Sequential() 
net .add(nn.Dense(10)) 
net.initialize(init.Normal(sigma=0.01)) 


3.7.2 Softmax Implementation Revisited 


In the previous example of Section 3.6, we calculated our model's output and then ran this output 
through the cross-entropy loss. Mathematically, that is a perfectly reasonable thing to do. How- 
ever, from a computational perspective, exponentiation can be a source of numerical stability 
issues. 

O where 7); is the ¡$ element of the 
predicted probability distribution y and o, is the ¡$ element of the logits o. If some of the o; 
are very large (i.e., very positive), then exp(o;,) might be larger than the largest number we can 
have for certain data types (i.e., overflow). This would make the denominator (and/or numerator) 
inf (infinity) and we wind up encountering either 0, inf, or nan (not a number) for 7;. In these 
situations we do not get a well-defined return value for cross-entropy. 


Recall that the softmax function calculates y; = 


One trick to get around this is to first subtract max(o;,) from all op before proceeding with the 
softmax calculation. You can verify that this shifting of each ox by constant factor does not change 
the return value of softmax. After the subtraction and normalization step, it might be possible that 
some oj have large negative values and thus that the corresponding exp(o;) will take values close 
to zero. These might be rounded to zero due to finite precision (i.e., underflow), making y; zero and 
giving us -inf for log(y;). A few steps down the road in backpropagation, we might find ourselves 
faced with a screenful of the dreaded nan results. 


Fortunately, we are saved by the fact that even though we are computing exponential functions, 
we ultimately intend to take their log (when calculating the cross-entropy loss). By combining 
these two operators softmax and cross-entropy together, we can escape the numerical stability 
issues that might otherwise plague us during backpropagation. As shown in the equation below, 
we avoid calculating exp(o;) and can use instead o; directly due to the canceling in log(exp(-)). 


log (95) = log (nee 


>p EXP (Ox) 
= log (exp(0;)) — log (= explo] (3.7.1) 
k 
=0j— log (= exo) . 
k 
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We will want to keep the conventional softmax function handy in case we ever want to evaluate 
the output probabilities by our model. But instead of passing softmax probabilities into our new 
loss function, we will just pass the logits and compute the softmax and its log all at once inside the 
cross-entropy loss function, which does smart things like the “LogSumExp trick”%, 


loss = gluon.loss.SoftmaxCrossEntropyLoss() 


3.7.3 Optimization Algorithm 


Here, we use minibatch stochastic gradient descent with a learning rate of 0.1 as the optimiza- 
tion algorithm. Note that this is the same as we applied in the linear regression example and it 
illustrates the general applicability of the optimizers. 


trainer = gluon.Trainer(net.collect_params(), ‘sgd’, {'learning_rate’: 0.1)) 


3.7.4 Training 


Next we call the training function defined in Section 3.6 to train the model. 


num_epochs = 10 
d21.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer) 


da — ee gt TI a. 
ra ee eee 
7 


—— train loss 
=== train acc 
—-- test acc 





epoch 


As before, this algorithm converges to a solution that achieves a decent accuracy, albeit this time 
with fewer lines of code than before. 





% https://en.wikipedia.org/wiki/LogSumExp 
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Summary 


e Using high-level APIs, we can implement softmax regression much more concisely. 


e From a computational perspective, implementing softmax regression has intricacies. Note 
that in many cases, a deep learning framework takes additional precautions beyond these 
most well-known tricks to ensure numerical stability, saving us from even more pitfalls that 
we would encounter if we tried to code all of our models from scratch in practice. 


Exercises 
1. Try adjusting the hyperparameters, such as the batch size, number of epochs, and learning 
rate, to see what the results are. 


2. Increase the numper of epochs for training. Why might the test accuracy decrease after a 
while? How could we fix this? 


Discussions®! 





6 https://discuss.d21.ai/t/52 
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4 Multilayer Perceptrons 


In this chapter, we will introduce your first truly deep network. The simplest deep networks are 
called multilayer perceptrons, and they consist of multiple layers of neurons each fully connected 
to those in the layer below (from which they receive input) and those above (which they, in turn, 
influence). When we train high-capacity models we run the risk of overfitting. Thus, we will 
need to provide your first rigorous introduction to the notions of overfitting, underfitting, and 
model selection. To help you combat these problems, we will introduce regularization techniques 
such as weight decay and dropout. We will also discuss issues relating to numerical stability and 
parameter initialization that are key to successfully training deep networks. Throughout, we aim 
to give you a firm grasp not just of the concepts but also of the practice of using deep networks. 
At the end of this chapter, we apply what we have introduced so far to a real case: house price 
prediction. We punt matters relating to the computational performance, scalability, and efficiency 
of our models to subsequent chapters. 


4.1 Multilayer Perceptrons 


In Chapter 3, we introduced softmax regression (Section 3.4), implementing the algorithm from 
scratch (Section 3.6) and using high-level APIs (Section 3.7), and training classifiers to recognize 10 
categories of clothing from low-resolution images. Along the way, we learned how to wrangle data, 
coerce our outputs into a valid probability distribution, apply an appropriate loss function, and 
minimize it with respect to our model’s parameters. Now that we have mastered these mechanics 
in the context of simple linear models, we can launch our exploration of deep neural networks, 
the comparatively rich class of models with which this book is primarily concerned. 


4.1.1 Hidden Layers 


We have described the affine transformation in Section 3.1.1, which is a linear transformation 
added by a bias. To begin, recall the model architecture corresponding to our softmax regression 
example, illustrated in Fig. 3.4.1. This model mapped our inputs directly to our outputs via a 
single affine transformation, followed by a softmax operation. If our labels truly were related to 
our input data by an affine transformation, then this approach would be sufficient. But linearity 
in affine transformations is a strong assumption. 
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Linear Models May Go Wrong 


For example, linearity implies the weaker assumption of monotonicity: that any increase in our 
feature must either always cause an increase in our model’s output (if the corresponding weight 
is positive), or always cause a decrease in our model’s output (if the corresponding weight is neg- 
ative). Sometimes that makes sense. For example, if we were trying to predict whether an indi- 
vidual will repay a loan, we might reasonably imagine that holding all else equal, an applicant 
with a higher income would always be more likely to repay than one with a lower income. While 
monotonic, this relationship likely is not linearly associated with the probability of repayment. 
An increase in income from 0 to 50 thousand likely corresponds to a bigger increase in likelihood 
of repayment than an increase from 1 million to 1.05 million. One way to handle this might be 
to preprocess our data such that linearity becomes more plausible, say, by using the logarithm of 
income as our feature. 


Note that we can easily come up with examples that violate monotonicity. Say for example that we 
want to predict probability of death based on body temperature. For individuals with a body tem- 
perature above 37°C (98.6°F), higher temperatures indicate greater risk. However, for individuals 
with body temperatures below 37° C, higher temperatures indicate lower risk! In this case too, we 
might resolve the problem with some clever preprocessing. Namely, we might use the distance 
from 37°C as our feature. 


But what about classifying images of cats and dogs? Should increasing the intensity of the pixel 
at location (13, 17) always increase (or always decrease) the likelihood that the image depicts a 
dog? Reliance on a linear model corresponds to the implicit assumption that the only requirement 
for differentiating cats vs. dogs is to assess the brightness of individual pixels. This approach is 
doomed to fail in a world where inverting an image preserves the category. 


And yet despite the apparent absurdity of linearity here, as compared with our previous exam- 
ples, it is less obvious that we could address the problem with a simple preprocessing fix. That 
is because the significance of any pixel depends in complex ways on its context (the values of the 
surrounding pixels). While there might exist a representation of our data that would take into 
account the relevant interactions among our features, on top of which a linear model would be 
suitable, we simply do not know how to calculate it by hand. With deep neural networks, we used 
observational data to jointly learn both a representation via hidden layers and a linear predictor 
that acts upon that representation. 


Incorporating Hidden Layers 


We can overcome these limitations of linear models and handle a more general class of functions 
by incorporating one or more hidden layers. The easiest way to do this is to stack many fully- 
connected layers on top of each other. Each layer feeds into the layer above it, until we generate 
outputs. We can think of the first L— 1 layers as our representation and the final layer as our linear 
predictor. This architecture is commonly called a multilayer perceptron, often abbreviated as MLP. 
Below, we depict an MLP diagrammatically (Fig. 4.1.1). 
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Fig. 4.1.1: An MLP with a hidden layer of 5 hidden units. 


This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units. Since the input 
layer does not involve any calculations, producing outputs with this network requires implement- 
ing the computations for both the hidden and output layers; thus, the number of layers in this 
MLP is 2. Note that these layers are both fully connected. Every input influences every neuron in 
the hidden layer, and each of these in turn influences every neuron in the output layer. However, 
as suggested by Section 3.4.3, the parameterization cost of MLPs with fully-connected layers can 
be prohibitively high, which may motivate tradeoff between parameter saving and model effec- 
tiveness even without changing the input or output size (Zhang et al., 2021). 


From Linear to Nonlinear 


As before, by the matrix X € R”*“, we denote a minibatch of n examples where each example has 
d inputs (features). For a one-hidden-layer MLP whose hidden layer has h hidden units, denote 
by H € R”*” the outputs of the hidden layer, which are hidden representations. In mathematics or 
code, H is also known as a hidden-layer variable or a hidden variable. Since the hidden and output 
layers are both fully connected, we have hidden-layer weights W(“) € R%*? and biases b® € R!*" 
and output-layer weights W®) e R**1 and biases b?) € R!*4, Formally, we calculate the outputs 
O € R”*1 of the one-hidden-layer MLP as follows: 


H = XW® + pb, 


O = HW® + b?). =g 


Note that after adding the hidden layer, our model now requires us to track and update additional 
sets of parameters. So what have we gained in exchange? You might be surprised to find out that— 
in the model defined above—we gain nothing for our troubles! The reason is plain. The hidden units 
above are given by an affine function of the inputs, and the outputs (pre-softmax) are just an affine 
function of the hidden units. An affine function of an affine function is itself an affine function. 
Moreover, our linear model was already capable of representing any affine function. 


We can view the equivalence formally by proving that for any values of the weights, we can just 
collapse out the hidden layer, yielding an equivalent single-layer model with parameters W = 
Wow) andb = bw?) + p@): 


O = (XW) + bY we?) + bb?) = xwOw?) + bow?) + b?) = XW +b. (4.1.2) 


In order to realize the potential of multilayer architectures, we need one more key ingredient: 
a nonlinear activation function c to be applied to each hidden unit following the affine transfor- 
mation. The outputs of activation functions (e.g., o(-)) are called activations. In general, with 
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activation functions in place, itis no longer possible to collapse our MLP into a linear model: 


H = o(XW®) + b®), 


O = HW”) + pb). 
Since each row in X corresponds to an example in the minibatch, with some abuse of notation, 
we define the nonlinearity o to apply to its inputs in a rowwise fashion, i.e., one example at a 
time. Note that we used the notation for softmax in the same way to denote a rowwise operation 
in Section 3.4.5. Often, as in this section, the activation functions that we apply to hidden layers 
are not merely rowwise, but elementwise. That means that after computing the linear portion of 
the layer, we can calculate each activation without looking at the values taken by the other hidden 
units. This is true for most activation functions. 


To build more general MLPs, we can continue stacking such hidden layers, e.g., H® = o(XW) + 
b®) and H®) = o.(H™ Ww) + bh), one atop another, yielding ever more expressive models. 


Universal Approximators 


MLPs can capture complex interactions among our inputs via their hidden neurons, which depend 
on the values of each of the inputs. We can easily design hidden nodes to perform arbitrary com- 
putation, for instance, basic logic operations on a pair of inputs. Moreover, for certain choices 
of the activation function, it is widely known that MLPs are universal approximators. Even with 
a single-hidden-layer network, given enough nodes (possibly absurdly many), and the right set of 
weights, we can model any function, though actually learning that function is the hard part. You 
might think of your neural network as being a bit like the C programming language. The language, 
like any other modern language, is capable of expressing any computable program. But actually 
coming up with a program that meets your specifications is the hard part. 


Moreover, just because a single-hidden-layer network can learn any function does not mean that 
you should try to solve all of your problems with single-hidden-layer networks. In fact, we can 
approximate many functions much more compactly by using deeper (vs. wider) networks. We 
will touch upon more rigorous arguments in subsequent chapters. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, np, npx 
npx.set_np() 


4.1.2 Activation Functions 


Activation functions decide whether a neuron should be activated or not by calculating the 
weighted sum and further adding bias with it. They are differentiable operators to transform 
input signals to outputs, while most of them add non-linearity. Because activation functions are 
fundamental to deep learning, let us briefly survey some common activation functions. 
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ReLU Function 


The most popular choice, due to both simplicity of implementation and its good performance ona 
variety of predictive tasks, is the rectified linear unit (ReLU). ReLU provides a very simple nonlinear 
transformation. Given an element z, the function is defined as the maximum of that element and 
0: 


ReLU(z) = max(z,0). (4.1.4) 


Informally, the ReLU function retains only positive elements and discards all negative elements 
by setting the corresponding activations to 0. To gain some intuition, we can plot the function. As 
you can see, the activation function is piecewise linear. 


x = np.arange(-8.0, 8.0, 0.1) 
x.attach_grad() 
with autograd.record(): 
y = npx.relu(x) 
d21.plot(x, y, 'x', ‘relu(x)’, figsize=(5, 2.5)) 


relu(x) 
D 


When the input is negative, the derivative of the ReLU function is 0, and when the input is positive, 
the derivative of the ReLU function is 1. Note that the ReLU function is not differentiable when the 
input takes value precisely equal to 0. In these cases, we default to the left-hand-side derivative 
and say that the derivative is 0 when the input is 0. We can get away with this because the input 
may never actually be zero. There is an old adage that if subtle boundary conditions matter, we 
are probably doing (real) mathematics, not engineering. That conventional wisdom may apply 
here. We plot the derivative of the ReLU function plotted below. 


y .backward() 
d21.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5)) 
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The reason for using ReLU is that its derivatives are particularly well behaved: either they vanish 
or they just let the argument through. This makes optimization better behaved and it mitigated 
the well-documented problem of vanishing gradients that plagued previous versions of neural 
networks (more on this later). 


Note that there are many variants to the ReLU function, including the parameterized ReLU (pReLU) 
function (He et al., 2015). This variation adds a linear term to ReLU, so some information still gets 
through, even when the argument is negative: 


pReLU(x) = max(0, x) + amin(0, x). (4.1.5) 


Sigmoid Function 


The sigmoid function transforms its inputs, for which values lie in the domain R, to outputs that lie 
on the interval (0, 1). For that reason, the sigmoid is often called a squashing function: it squashes 
any input in the range (-inf, inf) to some value in the range (0, 1): 


1 


(4.1.6) 
In the earliest neural networks, scientists were interested in modeling biological neurons which 
either fire or do not fire. Thus the pioneers of this field, going all the way back to McCulloch and 
Pitts, the inventors of the artificial neuron, focused on thresholding units. A thresholding activa- 
tion takes value 0 when its input is below some threshold and value 1 when the input exceeds the 
threshold. 


When attention shifted to gradient based learning, the sigmoid function was a natural choice be- 
cause it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are still widely 
used as activation functions on the output units, when we want to interpret the outputs as prob- 
abilities for binary classification problems (you can think of the sigmoid as a special case of the 
softmax). However, the sigmoid has mostly been replaced by the simpler and more easily train- 
able ReLU for most use in hidden layers. In later chapters on recurrent neural networks, we will 
describe architectures that leverage sigmoid units to control the flow of information across time. 


Below, we plot the sigmoid function. Note that when the input is close to 0, the sigmoid function 
approaches a linear transformation. 
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with autograd.record(): 
y = npx.sigmoid(x) 
d21.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5)) 
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The derivative of the sigmoid function is given by the following equation: 


d . ; = : f ; f 
re sigmoid(x) = (i a = sigmoid(x) (1 — sigmoid(x)). (4.1.7) 





The derivative of the sigmoid function is plotted below. Note that when the input is 0, the deriva- 
tive of the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either 
direction, the derivative approaches 0. 


y .backward() 
d21.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5)) 
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Tanh Function 


Like the sigmoid function, the tanh (hyperbolic tangent) function also squashes its inputs, trans- 
forming them into elements on the interval between -1 and 1: 


_ 1—exp(-2z) 


(4.1.8) 
We plot the tanh function below. Note that as the input nears 0, the tanh function approaches a lin- 


ear transformation. Although the shape of the function is similar to that of the sigmoid function, 
the tanh function exhibits point symmetry about the origin of the coordinate system. 


with autograd.record(): 
y = np.tanh(x) 
d21.plot(x, y, 'x', ‘tanh(x)’, figsize=(5, 2.5)) 
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The derivative of the tanh function is: 
d 
de tanh(w) = 1 — tanh?(z). (4.1.9) 
xv 


The derivative of tanh function is plotted below. As the input nears 0, the derivative of the tanh 
function approaches a maximum of 1. And as we saw with the sigmoid function, as the input 
moves away from 0 in either direction, the derivative of the tanh function approaches 0. 


y .backward() 
d21.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5)) 
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In summary, we now know how to incorporate nonlinearities to build expressive multilayer neu- 
ral network architectures. As a side note, your knowledge already puts you in command of a simi- 
lar toolkit to a practitioner circa 1990. In some ways, you have an advantage over anyone working 
in the 1990s, because you can leverage powerful open-source deep learning frameworks to build 
models rapidly, using only a few lines of code. Previously, training these networks required re- 
searchers to code up thousands of lines of C and Fortran. 


Summary 
e MLP adds one or multiple fully-connected hidden layers between the output and input layers 
and transforms the output of the hidden layer via an activation function. 


e Commonly-used activation functions include the ReLU function, the sigmoid function, and 
the tanh function. 


Exercises 


1. Compute the derivative of the pReLU activation function. 


2. Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise linear 
function. 


3. Show that tanh(x) + 1 = 2sigmoid(2x). 


4. Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of 
problems do you expect this to cause? 


Discussions®2 





€ https://discuss.d21.ai/t/90 
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4.2 Implementation of Multilayer Perceptrons from Scratch 


Now that we have characterized multilayer perceptrons (MLPs) mathematically, let us try to im- 
plement one ourselves. To compare against our previous results achieved with softmax regression 
(Section 3.6), we will continue to work with the Fashion-MNIST image classification dataset (Sec- 
tion 3.5). 


from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
npx.set_np() 


batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 


4.2.1 Initializing Model Parameters 


Recall that Fashion-MNIST contains 10 classes, and that each image consists of a 28 x 28 = 784 grid 
of grayscale pixel values. Again, we will disregard the spatial structure among the pixels for now, 
so we can think of this as simply a classification dataset with 784 input features and 10 classes. To 
begin, we will implement an MLP with one hidden layer and 256 hidden units. Note that we can 
regard both of these quantities as hyperparameters. Typically, we choose layer widths in powers of 
2, which tend to be computationally efficient because of how memory is allocated and addressed 
in hardware. 


Again, we will represent our parameters with several tensors. Note that for every layer, we must 
keep track of one weight matrix and one bias vector. As always, we allocate memory for the gra- 
dients of the loss with respect to these parameters. 


num_inputs, num_outputs, num_hiddens = 784, 10, 256 


W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens)) 
b1 = np.zeros(num_hiddens) 
W2 = np.random.normal(scale=0.01, size=(num_hiddens, num_outputs) ) 
b2 = np.zeros(num_outputs) 
params = [W1, b1, W2, b2] 


for param in params: 
param. attach_grad() 


4.2.2 Activation Function 


To make sure we know how everything works, we will implement the ReLU activation ourselves 
using the maximum function rather than invoking the built-in relu function directly. 


def relu(X): 
return np.maximum(X, 0) 





138 Chapter 4. Multilayer Perceptrons 


4.2.3 Model 


Because we are disregarding spatial structure, we reshape each two-dimensional image into a flat 
vector of length num_inputs. Finally, we implement our model with just a few lines of code. 


def net(X): 
X = X.reshape((-1, num_inputs)) 
H = relu(np.dot(X, W1) + b1) 
return np.dot(H, W2) + b2 


4.2.4 Loss Function 


To ensure numerical stability, and because we already implemented the softmax function from 
scratch (Section 3.6), we leverage the integrated function from high-level APIs for calculating the 
softmax and cross-entropy loss. Recall our earlier discussion of these intricacies in Section 3.7.2. 
We encourage the interested reader to examine the source code for the loss function to deepen 
their knowledge of implementation details. 


loss = gluon.loss.SoftmaxCrossEntropyLoss() 


4.2.5 Training 


Fortunately, the training loop for MLPs is exactly the same as for softmax regression. Leveraging 
the d21 package again, we call the train_ch3 function (see Section 3.6), setting the number of 
epochs to 10 and the learning rate to 0.1. 


num_epochs, lr = 10, 0.1 


d21.train_ch3(net, train_iter, test_iter, loss, num_epochs, 
lambda batch_size: d21.sgd(params, lr, batch_size)) 
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To evaluate the learned model, we apply it on some test data. 


d21.predict_ch3(net, test_iter) 
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Summary 


e We saw that implementing a simple MLP is easy, even when done manually. 


+ However, with a large number of layers, implementing MLPs from scratch can still get messy 


(e.g., naming and keeping track of our model's parameters). 


Exercises 


. Change the value of the hyperparameter num_hiddens and see how this hyperparameter in- 


fluences your results. Determine the best value of this hyperparameter, keeping all others 
constant. 


. Try adding an additional hidden layer to see how it affects the results. 


. How does changing the learning rate alter your results? Fixing the model architecture and 


other hyperparameters (including number of epochs), what learning rate gives you the best 
results? 


. Whatisthe best result you can get by optimizing over all the hyperparameters (learning rate, 


number of epochs, number of hidden layers, number of hidden units per layer) jointly? 


. Describe why it is much more challenging to deal with multiple hyperparameters. 


. What is the smartest strategy you can think of for structuring a search over multiple hyper- 


parameters? 


Discussions® 


4.3 Concise Implementation of Multilayer Perceptrons 


As you might expect, by relying on the high-level APIs, we can implement MLPs even more con- 
cisely. 


from d21 import mxnet as d21 

from mxnet import gluon, init, npx 
from mxnet.gluon import nn 
npx.set_np() 





6 https://discuss.d21.ai/t/92 
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4.3.1 Model 


As compared with our concise implementation of softmax regression implementation (Section 
3.7), the only difference is that we add two fully-connected layers (previously, we added one). The 
first is our hidden layer, which contains 256 hidden units and applies the ReLU activation function. 
The second is our output layer. 


net = nn.Sequential() 

net.add(nn.Dense(256, activation='relu’), 
nn.Dense(10)) 

net.initialize(init.Normal(sigma=0.01)) 


The training loop is exactly the same as when we implemented softmax regression. This modu- 
larity enables us to separate matters concerning the model architecture from orthogonal consid- 
erations. 


batch_size, lr, num_epochs = 256, 0.1, 10 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 
trainer = gluon.Trainer(net.collect_params(), ‘sgd’, {'’learning_rate’: 1r}) 


train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
d21.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer) 


—— train loss 
--- train acc 
—-= test acc 
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Summary 


e Using high-level APIs, we can implement MLPs much more concisely. 


e For the same classification problem, the implementation of an MLP is the same as that of 
softmax regression except for additional hidden layers with activation functions. 


Exercises 


1. Try adding different numbers of hidden layers (you may also modify the learning rate). What 
setting works best? 


2. Try out different activation functions. Which one works best? 
3. Try different schemes for initializing the weights. What method works best? 


Discussions®* 


4.4 Model Selection, Underfitting, and Overfitting 


As machine learning scientists, our goal is to discover patterns. But how can we be sure that we 
have truly discovered a general pattern and not simply memorized our data? For example, imagine 
that we wanted to hunt for patterns among genetic markers linking patients to their dementia 
status, where the labels are drawn from the set {dementia, mild cognitive impairment, healthy}. 
Because each person's genes identify them uniquely (ignoring identical siblings), it is possible to 
memorize the entire dataset. 


We do not want our model to say “That’s Bob! I remember him! He has dementia!” The reason why 
is simple. When we deploy the model in the future, we will encounter patients that the model has 
never seen before. Our predictions will only be useful if our model has truly discovered a general 
pattern. 


To recapitulate more formally, our goal is to discover patterns that capture regularities in the un- 
derlying population from which our training set was drawn. If we are successful in this endeavor, 
then we could successfully assess risk even for individuals that we have never encountered before. 
This problem—how to discover patterns that generalize—is the fundamental problem of machine 
learning. 


The danger is that when we train models, we access just a small sample of data. The largest public 
image datasets contain roughly one million images. More often, we must learn from only thou- 
sands or tens of thousands of data examples. In a large hospital system, we might access hundreds 
of thousands of medical records. When working with finite samples, we run the risk that we might 
discover apparent associations that turn out not to hold up when we collect more data. 


The phenomenon of fitting our training data more closely than we fit the underlying distribution 
is called overfitting, and the techniques used to combat overfitting are called regularization. In 
the previous sections, you might have observed this effect while experimenting with the Fashion- 
MNIST dataset. If you altered the model structure or the hyperparameters during the experiment, 
you might have noticed that with enough neurons, layers, and training epochs, the model can 
eventually reach perfect accuracy on the training set, even as the accuracy on test data deterio- 
rates. 





6* https://discuss.d21.ai/t/94 
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4.4.1 Training Error and Generalization Error 


In order to discuss this phenomenon more formally, we need to differentiate between training 
error and generalization error. The training error is the error of our model as calculated on the 
training dataset, while generalization error is the expectation of our model's error were we to apply 
it to an infinite stream of additional data examples drawn from the same underlying data distri- 
bution as our original sample. 


Problematically, we can never calculate the generalization error exactly. That is because the 
stream of infinite data is an imaginary object. In practice, we must estimate the generalization 
error by applying our model to an independent test set constituted of a random selection of data 
examples that were withheld from our training set. 


The following three thought experiments will help illustrate this situation better. Consider a col- 
lege student trying to prepare for his final exam. A diligent student will strive to practice well and 
test his abilities using exams from previous years. Nonetheless, doing well on past exams is no 
guarantee that he will excel when it matters. For instance, the student might try to prepare by rote 
learning the answers to the exam questions. This requires the student to memorize many things. 
She might even remember the answers for past exams perfectly. Another student might prepare 
by trying to understand the reasons for giving certain answers. In most cases, the latter student 
will do much better. 


Likewise, consider a model that simply uses a lookup table to answer questions. Ifthe set of allow- 
able inputs is discrete and reasonably small, then perhaps after viewing many training examples, 
this approach would perform well. Still this model has no ability to do better than random guess- 
ing when faced with examples that it has never seen before. In reality the input spaces are far too 
large to memorize the answers corresponding to every conceivable input. For example, consider 
the black and white 28 x 28 images. If each pixel can take one among 256 grayscale values, then 
there are 256"8* possible images. That means that there are far more low-resolution grayscale 
thumbnail-sized images than there are atoms in the universe. Even if we could encounter such 
data, we could never afford to store the lookup table. 


Last, consider the problem of trying to classify the outcomes of coin tosses (class 0: heads, class 
1: tails) based on some contextual features that might be available. Suppose that the coin is fair. 
No matter what algorithm we come up with, the generalization error will always be 3. However, 
for most algorithms, we should expect our training error to be considerably lower, depending on 
the luck of the draw, even if we did not have any features! Consider the dataset (0, 1, 1, 1, 0, 1}. 
Our feature-less algorithm would have to fall back on always predicting the majority class, which 
appears from our limited sample to be 1. In this case, the model that always predicts class 1 will 
incur an error of 3, considerably better than our generalization error. As we increase the amount 
of data, the probability that the fraction of heads will deviate significantly from 3 diminishes, and 
our training error would come to match the generalization error. 
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Statistical Learning Theory 


Since generalization is the fundamental problem in machine learning, you might not be surprised 
to learn that many mathematicians and theorists have dedicated their lives to developing formal 
theories to describe this phenomenon. In their eponymous theorem'”, Glivenko and Cantelli de- 
rived the rate at which the training error converges to the generalization error. In a series of semi- 
nal papers, Vapnik and Chervonenkis* extended this theory to more general classes of functions. 
This work laid the foundations of statistical learning theory. 


In the standard supervised learning setting, which we have addressed up until now and will stick 
with throughout most of this book, we assume that both the training data and the test data are 
drawn independently from identical distributions. This is commonly called the i.i.d. assumption, 
which means that the process that samples our data has no memory. In other words, the second 
example drawn and the third drawn are no more correlated than the second and the two-millionth 
sample drawn. 


Being a good machine learning scientist requires thinking critically, and already you should be 
poking holes in this assumption, coming up with common cases where the assumption fails. What 
if we train a mortality risk predictor on data collected from patients at UCSF Medical Center, and 
apply it on patients at Massachusetts General Hospital? These distributions are simply not identi- 
cal. Moreover, draws might be correlated in time. What if we are classifying the topics of Tweets? 
The news cycle would create temporal dependencies in the topics being discussed, violating any 
assumptions of independence. 


Sometimes we can get away with minor violations of the i.i.d. assumption and our models will 
continue to work remarkably well. After all, nearly every real-world application involves at least 
some minor violation of the i.i.d. assumption, and yet we have many useful tools for various ap- 
plications such as face recognition, speech recognition, and language translation. 


Other violations are sure to cause trouble. Imagine, for example, if we try to train a face recog- 
nition system by training it exclusively on university students and then want to deploy it as a tool 
for monitoring geriatrics in a nursing home population. This is unlikely to work well since college 
students tend to look considerably different from the elderly. 


In subsequent chapters, we will discuss problems arising from violations of the i.i.d. assump- 
tion. For now, even taking the i.i.d. assumption for granted, understanding generalization is a 
formidable problem. Moreover, elucidating the precise theoretical foundations that might ex- 
plain why deep neural networks generalize as well as they do continues to vex the greatest minds 
in learning theory. 


When we train our models, we attempt to search for a function that fits the training data as well as 
possible. If the function is so flexible that it can catch on to spurious patterns just as easily as to 
true associations, then it might perform too well without producing a model that generalizes well 
to unseen data. This is precisely what we want to avoid or at least control. Many of the techniques 
in deep learning are heuristics and tricks aimed at guarding against overfitting. 





6 https://en.wikipedia. org/wiki/Glivenko%E2%80%93Cantelli_theorem 
& https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory 
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Model Complexity 


When we have simple models and abundant data, we expect the generalization error to resemble 
the training error. When we work with more complex models and fewer examples, we expect the 
training error to go down but the generalization gap to grow. What precisely constitutes model 
complexity is a complex matter. Many factors govern whether a model will generalize well. For 
example a model with more parameters might be considered more complex. A model whose 
parameters can take a wider range of values might be more complex. Often with neural networks, 
we think of a model that takes more training iterations as more complex, and one subject to early 
stopping (fewer training iterations) as less complex. 


It can be difficult to compare the complexity among members of substantially different model 
classes (say, decision trees vs. neural networks). For now, a simple rule of thumb is quite useful: 
a model that can readily explain arbitrary facts is what statisticians view as complex, whereas one 
that has only a limited expressive power but still manages to explain the data well is probably 
closer to the truth. In philosophy, this is closely related to Popper's criterion of falsifiability of a 
scientific theory: a theory is good if it fits data and if there are specific tests that can be used to 
disprove it. This is important since all statistical estimation is post hoc, i.e., we estimate after we 
observe the facts, hence vulnerable to the associated fallacy. For now, we will put the philosophy 
aside and stick to more tangible issues. 


In this section, to give you some intuition, we will focus on a few factors that tend to influence the 
generalizability of a model class: 


1. The number of tunable parameters. When the number of tunable parameters, sometimes 
called the degrees of freedom, is large, models tend to be more susceptible to overfitting. 


2. The values taken by the parameters. When weights can take a wider range of values, models 
can be more susceptible to overfitting. 


3. The number of training examples. It is trivially easy to overfit a dataset containing only 
one or two examples even if your model is simple. But overfitting a dataset with millions of 
examples requires an extremely flexible model. 


4.4.2 Model Selection 


In machine learning, we usually select our final model after evaluating several candidate models. 
This process is called model selection. Sometimes the models subject to comparison are fundamen- 
tally different in nature (say, decision trees vs. linear models). At other times, we are comparing 
members of the same class of models that have been trained with different hyperparameter set- 
tings. 


With MLPs, for example, we may wish to compare models with different numbers of hidden layers, 
different numbers of hidden units, and various choices of the activation functions applied to each 
hidden layer. In order to determine the best among our candidate models, we will typically employ 
a validation dataset. 
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Validation Dataset 


In principle we should not touch our test set until after we have chosen all our hyperparameters. 
Were we to use the test data in the model selection process, there is a risk that we might overfit 
the test data. Then we would be in serious trouble. If we overfit our training data, there is always 
the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever 
know? 


Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on 
the training data for model selection either because we cannot estimate the generalization error 
on the very data that we use to train the model. 


In practical applications, the picture gets muddier. While ideally we would only touch the test 
data once, to assess the very best model or to compare a small number of models to each other, 
real-world test data is seldom discarded after just one use. We can seldom afford a new test set for 
each round of experiments. 


The common practice to address this problem is to split our data three ways, incorporating a vali- 
dation dataset (or validation set) in addition to the training and test datasets. The result is a murky 
practice where the boundaries between validation and test data are worryingly ambiguous. Un- 
less explicitly stated otherwise, in the experiments in this book we are really working with what 
should rightly be called training data and validation data, with no true test sets. Therefore, the 
accuracy reported in each experiment of the book is really the validation accuracy and not a true 
test set accuracy. 


K-Fold Cross-Validation 


When training data is scarce, we might not even be able to afford to hold out enough data to con- 
stitute a proper validation set. One popular solution to this problem is to employ K-fold cross- 
validation. Here, the original training data is split into K non-overlapping subsets. Then model 
training and validation are executed K times, each time training on K — 1 subsets and validat- 
ing on a different subset (the one not used for training in that round). Finally, the training and 
validation errors are estimated by averaging over the results from the K experiments. 


4.4.3 Underfitting or Overfitting? 


When we compare the training and validation errors, we want to be mindful of two common situ- 
ations. First, we want to watch out for cases when our training error and validation error are both 
substantial but there is a little gap between them. If the model is unable to reduce the training 
error, that could mean that our model is too simple (i.e., insufficiently expressive) to capture the 
pattern that we are trying to model. Moreover, since the generalization gap between our train- 
ing and validation errors is small, we have reason to believe that we could get away with a more 
complex model. This phenomenon is known as underfitting. 


On the other hand, as we discussed above, we want to watch out for the cases when our train- 
ing error is significantly lower than our validation error, indicating severe overfitting. Note that 
overfitting is not always a bad thing. With deep learning especially, it is well known that the best 
predictive models often perform far better on training data than on holdout data. Ultimately, we 
usually care more about the validation error than about the gap between the training and valida- 
tion errors. 
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Whether we overfit or underfit can depend both on the complexity of our model and the size of 
the available training datasets, two topics that we discuss below. 


Model Complexity 


To illustrate some classical intuition about overfitting and model complexity, we give an example 
using polynomials. Given training data consisting of a single feature x and a corresponding real- 
valued label y, we try to find the polynomial of degree d 


d 
=>) eu (4.4.1) 
1=0 


to estimate the labels y. This is just a linear regression problem where our features are given by 
the powers of x, the model’s weights are given by w;, and the bias is given by wọ since x° = 1 for all 
x. Since this is just a linear regression problem, we can use the squared error as our loss function. 


A higher-order polynomial function is more complex than a lower-order polynomial function, 
since the higher-order polynomial has more parameters and the model function's selection range 
is wider. Fixing the training dataset, higher-order polynomial functions should always achieve 
lower (at worst, equal) training error relative to lower degree polynomials. In fact, whenever the 
data examples each have a distinct value of x, a polynomial function with degree equal to the 
number of data examples can fit the training set perfectly. We visualize the relationship between 
polynomial degree and underfitting vs. overfitting in Fig. 4.4.1. 


+ ——> 
Underfitting Optimum Overfitting 


Loss 






Generalization loss 


Training loss 






Model complexity 


Fig. 4.4.1: Influence of model complexity on underfitting and overfitting 


Dataset Size 


The other big consideration to bear in mind is the dataset size. Fixing our model, the fewer sam- 
ples we have in the training dataset, the more likely (and more severely) we are to encounter over- 
fitting. As we increase the amount of training data, the generalization error typically decreases. 
Moreover, in general, more data never hurt. For a fixed task and data distribution, there is typi- 
cally a relationship between model complexity and dataset size. Given more data, we might prof- 
itably attempt to fit a more complex model. Absent sufficient data, simpler models may be more 
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difficult to beat. For many tasks, deep learning only outperforms linear models when many thou- 
sands of training examples are available. In part, the current success of deep learning owes to 
the current abundance of massive datasets due to Internet companies, cheap storage, connected 
devices, and the broad digitization of the economy. 


4.4.4 Polynomial Regression 


We can now explore these concepts interactively by fitting polynomials to data. 


from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
from mxnet.gluon import nn 
import math 

npx.set_np() 


Generating the Dataset 


First we need data. Given x, we will use the following cubic polynomial to generate the labels on 
training and test data: 


2 3 
y=5+1.22 345 43 5.655 + e where e ~ N(0,0.12). (4.4.2) 
The noise term e obeys a normal distribution with a mean of 0 and a standard deviation of 0.1. For 
optimization, we typically want to avoid very large values of gradients or losses. This is why the 


features are rescaled from x’ to 4. It allows us to avoid very large values for large exponents i. We 
will synthesize 100 samples each for the training set and test set. 





max_degree = 20 + Maximum degree of the polynomial 

n_train, n_test = 100, 100 + Training and test dataset sizes 
true_w = np.zeros(max_degree) # Allocate lots of empty space 
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6]) 


features = np.random.normal(size=(n_train + n_test, 1)) 
np.random. shuffle(features) 
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1)) 
for i in range(max_degree): 
poly_features[:, i] /= math.gamma(i + 1) + ‘gamma(n)* = (n-1)! 
# Shape of ‘labels*‘: (‘n_train* + ‘n_test*,) 
labels = np.dot(poly_features, true_w) 
labels += np.random.normal(scale=0.1, size=labels.shape) 


Again, monomials stored in poly_features are rescaled by the gamma function, where r (n) = 
(n — 1)!. Take a look at the first 2 samples from the generated dataset. The value 1 is technically a 
feature, namely the constant feature corresponding to the bias. 


features[:2], poly_features[:2, :], labels[:2] 


(array (L[L[-0. 03716067], 
[-1.1468065 ]]), 
array([[ 1.0000000e+00, -3.7160669e-02, 6.9045764e-04, -8.5526226e-06, 


(continues on next page) 
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(continued from previous page) 


.6573678e-12, -1.9415747e-14, 
.3837962e-21, -4.6747992e-24, 
.0984010e-31, -2.7211542e-34, 
.8516424e-42, -5.6051939e-45], 
.5758252e-01, -2.5137332e-01, 
.1594271e-03, -5.1760738e-04, 
.0842722e-06, -1.1304095e-07, 
.8064499e-11, -5.9683248e-12, 
.8385756e-15, -1.1097316e-16]]), 


194552906708 5, IZAN, 
MOMS TASCA 5 8), Y LAMAS 
.4476556e-26, -4.1381425e-29, 
MONA MIES 
.0000000e+00, -1.1468065e+00, 
-206913 16-02 m 17652986967027 
.4199430e-05, -9.4547095e-06, 
.0803007e-08, -9.5299690e-10, 
.2778208e-13, -2.8857840e-14, 
array([ 5.1432443 , -0.06415121])) 
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Training and Testing the Model 


Let us first implement a function to evaluate the loss on a given dataset. 


def evaluate_loss(net, data_iter, loss): #@save 
"""Evaluate the loss of a model on the given dataset. 
metric = d21.Accumulator(2) # Sum of losses, no. of examples 
for X, y in data_iter: 
1 = loss(net(X), y) 
metric.add(1.sum(), 1.size) 
return metric[0] / metric[1] 


nnn 


Now define the training function. 


def train(train_features, test_features, train_labels, test_labels, 
num_epochs=400) : 
loss = gluon.loss.L2Loss() 
net = nn.Sequential() 
# Switch off the bias since we already catered for it in the polynomial 
# features 
net.add(nn.Dense(1, use_bias=False)) 
net.initialize() 
batch_size = min(10, train_labels.shape[0]) 
train_iter = d21.load_array((train_features, train_labels), batch_size) 
test_iter = d21.load_array((test_features, test_labels), batch_size, 
is_train=False) 
trainer = gluon.Trainer(net.collect_params(), ‘sgd’, 
{'learning_rate’: @.01}) 
animator = d21.Animator(xlabel='epoch', ylabel='loss', yscale='log’, 
xlim=[1, num_epochs], ylim=[1e-3, 1e2], 
legend=['train', 'test']) 
for epoch in range(num_epochs): 
d21.train_epoch_ch3(net, train_iter, loss, trainer) 
if epoch == @ or (epoch + 1) % 20 == 0: 
animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss), 
evaluate_loss(net, test_iter, loss))) 
print('weight:', net[0].weight.data().asnumpy()) 
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Third-Order Polynomial Function Fitting (Normal) 


We will begin by first using a third-order polynomial function, which is the same order as that 
of the data generation function. The results show that this model's training and test losses can 
be both effectively reduced. The learned model parameters are also close to the true values w = 
[5, 1.2, —3.4, 5.6]. 


# Pick the first four dimensions, i.e., 1, x, x*2/2!, x*3/3! from the 

# polynomial features 

train(poly_features[:n_train, :4], poly_features[n_train:, :4], 
labels[:n_train], labels[n_train: ]) 


weight: [[ 5.0191875 1.2220242 -3.4236171 5.5718174]] 
10? 
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Linear Function Fitting (Underfitting) 


Let us take another look at linear function fitting. After the decline in early epochs, it becomes 
difficult to further decrease this model's training loss. After the last epoch iteration has been 
completed, the training loss is still high. When used to fit nonlinear patterns (like the third-order 
polynomial function here) linear models are liable to underfit. 


# Pick the first two dimensions, i.e., 1, x, from the polynomial features 


train(poly_features[:n_train, :2], poly_features[n_train:, :2], 
labels[:n_train], labels[n_train: ]) 


weight: [[2.6977625 4.236942 ]] 
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Higher-Order Polynomial Function Fitting (Overfitting) 


Now let us try to train the model using a polynomial of too high degree. Here, there are insufficient 
data to learn that the higher-degree coefficients should have values close to zero. As a result, our 
overly-complex model is so susceptible that it is being influenced by noise in the training data. 
Though the training loss can be effectively reduced, the test loss is still much higher. It shows that 
the complex model overfits the data. 


# Pick all the dimensions from the polynomial features 


train(poly_features[:n_train, :], poly_features[n_train:, :], 
labels[:n_train], labels[n_train:], num_epochs=1500) 


weight: [[ 4.9921093 1.3059008 -3.3530357 5.116468  -0.11154182 1.3030001 
0.1267308  0.16649957 0.05129375 -0.02275844 0.00806225 -0.05167888 
-0.02426308 -0.01502205 -0.04941351 0.06389864 -0.04761846 -0.04380165 
-0.05188227 0.056557751] 
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In the subsequent sections, we will continue to discuss overfitting problems and methods for deal- 
ing with them, such as weight decay and dropout. 
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Summary 


Since the generalization error cannot be estimated based on the training error, simply min- 
imizing the training error will not necessarily mean a reduction in the generalization error. 
Machine learning models need to be careful to safeguard against overfitting so as to mini- 
mize the generalization error. 


A validation set can be used for model selection, provided that it is not used too liberally. 


Underfitting means that a model is not able to reduce the training error. When training error 
is much lower than validation error, there is overfitting. 


We should choose an appropriately complex model and avoid using insufficient training 
samples. 


Exercises 


1. Can you solve the polynomial regression problem exactly? Hint: use linear algebra. 


N 


. Consider model selection for polynomials: 


1. Plot the training loss vs. model complexity (degree of the polynomial). What do you 
observe? What degree of polynomial do you need to reduce the training loss to 0? 


2. Plot the test loss in this case. 


3. Generate the same plot as a function of the amount of data. 


w 


. What happens if you drop the normalization (1/i!) of the polynomial features zt? Can you 
fix this in some other way? 


4. Can you ever expect to see zero generalization error? 


Discussions” 


4.5 Weight Decay 


Now that we have characterized the problem of overfitting, we can introduce some standard tech- 
niques for regularizing models. Recall that we can always mitigate overfitting by going out and 
collecting more training data. That can be costly, time consuming, or entirely out of our control, 
making it impossible in the short run. For now, we can assume that we already have as much 
high-quality data as our resources permit and focus on regularization techniques. 


Recall that in our polynomial regression example (Section 4.4) we could limit our model's capacity 
simply by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features 
is a popular technique to mitigate overfitting. However, simply tossing aside features can be too 
blunt an instrument for the job. Sticking with the polynomial regression example, consider what 
might happen with high-dimensional inputs. The natural extensions of polynomials to multivari- 
ate data are called monomials, which are simply products of powers of variables. The degree of a 
monomial is the sum of the powers. For example, 1712, and 231? are both monomials of degree 
3. 


6 https://discuss.d21.ai/t/96 
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Note that the number of terms with degree d blows up rapidly as d grows larger. Given k variables, 
the number of monomials of degree d (i.e., k multichoose d) is Ga): Even small changes in 
degree, say from 2 to 3, dramatically increase the complexity of our model. Thus we often need a 
more fine-grained tool for adjusting function complexity. 


4.5.1 Norms and Weight Decay 


We have described both the Lz norm and the Lı norm, which are special cases of the more general 
Lp norm in Section 2.3.10. Weight decay (commonly called Lo regularization), might be the most 
widely-used technique for regularizing parametric machine learning models. The technique is 
motivated by the basic intuition that among all functions f, the function f = 0 (assigning the 
value 0 to all inputs) is in some sense the simplest, and that we can measure the complexity of a 
function by its distance from zero. But how precisely should we measure the distance between 
a function and zero? There is no single right answer. In fact, entire branches of mathematics, 
including parts of functional analysis and the theory of Banach spaces, are devoted to answering 
this issue. 


One simple interpretation might be to measure the complexity of a linear function f(x) = w'x 


by some norm of its weight vector, e.g., ||w||?. The most common method for ensuring a small 
weight vector is to add its norm as a penalty term to the problem of minimizing the loss. Thus we 
replace our original objective, minimizing the prediction loss on the training labels, with new objec- 
tive, minimizing the sum of the prediction loss and the penalty term. Now, if our weight vector grows 
too large, our learning algorithm might focus on minimizing the weight norm ||w||? vs. minimiz- 
ing the training error. That is exactly what we want. To illustrate things in code, let us revive our 
previous example from Section 3.1 for linear regression. There, our loss was given by 


L(w, b) = = y ; (w'x ah yy (4.5.1) 


Recall that x“ are the features, y® are labels for all data examples i, and (w, b) are the weight 
and bias parameters, respectively. To penalize the size of the weight vector, we must somehow 
add ||w]|? to the loss function, but how should the model trade off the standard loss for this new 
additive penalty? In practice, we characterize this tradeoff via the regularization constant A, a non- 
negative hyperparameter that we fit using validation data: 


A 
L(w, b) + ME (4.5.2) 


For A = 0, we recover our original loss function. For A > 0, we restrict the size of ||w||. We divide 
by 2 by convention: when we take the derivative of a quadratic function, the 2 and 1/2 cancel out, 
ensuring that the expression for the update looks nice and simple. The astute reader might wonder 
why we work with the squared norm and not the standard norm (i.e., the Euclidean distance). We 
do this for computational convenience. By squaring the Lə norm, we remove the square root, 
leaving the sum of squares of each component of the weight vector. This makes the derivative of 
the penalty easy to compute: the sum of derivatives equals the derivative of the sum. 


Moreover, you might ask why we work with the Lə norm in the first place and not, say, the Lı 
norm. In fact, other choices are valid and popular throughout statistics. While L2-regularized 
linear models constitute the classic ridge regression algorithm, L¡-regularized linear regression is 
a similarly fundamental model in statistics, which is popularly known as lasso regression. 


One reason to work with the Lə norm is that it places an outsize penalty on large components of the 
weight vector. This biases our learning algorithm towards models that distribute weight evenly 
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across a larger number of features. In practice, this might make them more robust to measure- 
ment error in a single variable. By contrast, Lı penalties lead to models that concentrate weights 
on a small set of features by clearing the other weights to zero. This is called feature selection, 
which may be desirable for other reasons. 


Using the same notation in (3.1.10), the minibatch stochastic gradient descent updates for Lə- 
regularized regression follow: 





di À (wly® (i) 
w<(1-nA)w XxX [(wx"+b-=y"). 
( ) B| 2. ( ) (4.5.3) 


As before, we update w based on the amount by which our estimate differs from the observation. 
However, we also shrink the size of w towards zero. That is why the method is sometimes called 
“weight decay”: given the penalty term alone, our optimization algorithm decays the weight at each 
step of training. In contrast to feature selection, weight decay offers us a continuous mechanism 
for adjusting the complexity of a function. Smaller values of A correspond to less constrained w, 
whereas larger values of \ constrain w more considerably. 


Whether we include a corresponding bias penalty b? can vary across implementations, and may 
vary across layers of a neural network. Often, we do not regularize the bias term of a network’s 
output layer. 


4.5.2 High-Dimensional Linear Regression 


We can illustrate the benefits of weight decay through a simple synthetic example. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


First, we generate some data as before 


d 
y = 0.05 + X` 0.01; + e where e ~ N(0, 0.01’). (4.5.4) 
i=1 


We choose our label to be a linear function of our inputs, corrupted by Gaussian noise with zero 
mean and standard deviation 0.01. To make the effects of overfitting pronounced, we can increase 
the dimensionality of our problem to d = 200 and work with a small training set containing only 
20 examples. 


n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5 

true_w, true_b = np.ones((num_inputs, 1)) * 0.01, 0.05 

train_data = d21.synthetic_data(true_w, true_b, n_train) 
train_iter = d21.1oad_array(train_data, batch_size) 

test_data = d21.synthetic_data(true_w, true_b, n_test) 

test_iter = d21.1oad_array(test_data, batch_size, is_train=False) 
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4.5.3 Implementation from Scratch 


In the following, we will implement weight decay from scratch, simply by adding the squared La 
penalty to the original target function. 


Initializing Model Parameters 


First, we will define a function to randomly initialize our model parameters. 


def init_params(): 
w = np.random.normal(scale=1, size=(num_inputs, 1)) 
b = np.zeros(1) 
w.attach_grad() 
b.attach_grad() 
return [w, b] 


Defining Lə Norm Penalty 


Perhaps the most convenient way to implement this penalty is to square all terms in place and 
sum them up. 


def 12_penalty(w): 
return (wxx2).sum() / 2 


Defining the Training Loop 


The following code fits a model on the training set and evaluates it on the test set. The linear 
network and the squared loss have not changed since Chapter 3, so we will just import them via 
d21.linreg and d21.squared_loss. The only change here is that our loss now includes the penalty 
term. 


def train(lambd) : 
w, b = init_params() 
net, loss = lambda X: d21.linreg(X, w, b), d21.squared_loss 
num_epochs, lr = 100, 0.003 
animator = d21.Animator(xlabel='epochs', ylabel=’loss', yscale='log', 
xlim=[5, num_epochs], legend=['train', 'test']) 
for epoch in range(num_epochs): 
for X, y in train_iter: 
with autograd.record(): 
# The L2 norm penalty term has been added, and broadcasting 
# makes ‘12_penalty(w)* a vector whose length is 'batch_size' 
1 = loss(net(X), y) + lambd * 12_penalty(w) 
1.backward() 
d21.sgd([w, b], 1r, batch_size) 
if (epoch + 1) % 5 == Q: 
animator.add(epoch + 1, (d21.evaluate_loss(net, train_iter, loss), 
d21.evaluate_loss(net, test_iter, loss))) 
print('L2 norm of w:', np.linalg.norm(w)) 
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Training without Regularization 


We now run this code with lambd = 0, disabling weight decay. Note that we overfit badly, decreas- 
ing the training error but not the test error—a textook case of overfitting. 


train(lambd=0) 


L2 norm of w: 13.259391 


10! 


107? 


loss 


1073 
1075 


20 40 60 80 100 
epochs 


Using Weight Decay 


Below, we run with substantial weight decay. Note that the training error increases but the test 
error decreases. This is precisely the effect we expect from regularization. 


train(lambd=3) 


L2 norm of w: 00.3824777 


loss 





20 40 60 80 100 
epochs 
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4.5.4 Concise Implementation 


Because weight decay is ubiquitous in neural network optimization, the deep learning framework 
makes it especially convenient, integrating weight decay into the optimization algorithm itself for 
easy use in combination with any loss function. Moreover, this integration serves a computational 
benefit, allowing implementation tricks to add weight decay to the algorithm, without any addi- 
tional computational overhead. Since the weight decay portion of the update depends only on the 
current value of each parameter, the optimizer must touch each parameter once anyway. 


In the following code, we specify the weight decay hyperparameter directly through wd when in- 
stantiating our Trainer. By default, Gluon decays both weights and biases simultaneously. Note 
that the hyperparameter wd will be multiplied by wd_mult when updating model parameters. Thus, 
if we set wd_mult to zero, the bias parameter b will not decay. 


def train_concise(wd): 
net = nn.Sequential() 
net .add(nn.Dense(1)) 
net.initialize(init .Normal(sigma=1)) 
loss = gluon.loss.L2Loss() 
num_epochs, lr = 100, 0.003 
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
{'learning_rate’: lr, ‘wd’: wd}) 
# The bias parameter has not decayed. Bias names generally end with "bias” 
net.collect_params('.*bias').setattr('wd_mult', 0) 
animator = d21.Animator(xlabel='epochs', ylabel='"loss', yscale='log', 
xlim=[5, num_epochs], legend=['train', 'test']7) 
for epoch in range(num_epochs): 
for X, y in train_iter: 
with autograd.record(): 
1 = loss(net(X), y) 
1.backward() 
trainer .step(batch_size) 
if (epoch + 1) % 5 == 0: 
animator.add(epoch + 1, (d21.evaluate_loss(net, train_iter, loss), 
d21.evaluate_loss(net, test_iter, loss))) 
print('L2 norm of w:', np.linalg.norm(net[0].weight.data())) 


The plots look identical to those when we implemented weight decay from scratch. However, they 
run appreciably faster and are easier to implement, a benefit that will become more pronounced 
for larger problems. 


train_concise(0) 


L2 norm of w: 15.01407 
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train_concise(3) 


L2 norm of w: 0.33992025 
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So far, we only touched upon one notion of what constitutes a simple linear function. Moreover, 
what constitutes a simple nonlinear function can be an even more complex question. For in- 
stance, reproducing kernel Hilbert space (RKHS)® allows one to apply tools introduced for lin- 
ear functions in a nonlinear context. Unfortunately, RKHS-based algorithms tend to scale poorly 
to large, high-dimensional data. In this book we will default to the simple heuristic of applying 
weight decay on all layers of a deep network. 





6 https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space 
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Summary 


Regularization is a common method for dealing with overfitting. It adds a penalty term to 
the loss function on the training set to reduce the complexity of the learned model. 


One particular choice for keeping the model simple is weight decay using an Lə penalty. This 
leads to weight decay in the update steps of the learning algorithm. 


The weight decay functionality is provided in optimizers from deep learning frameworks. 


Different sets of parameters can have different update behaviors within the same training 
loop. 


Exercises 


. Experiment with the value of A in the estimation problem in this section. Plot training and 


test accuracy as a function of A. What do you observe? 


2. Use a validation set to find the optimal value of A. Is it really the optimal value? Does this 
matter? 

3. What would the update equations look like if instead of ||w||? we used >>, |w;| as our penalty 
of choice (Lı regularization)? 

4. We know that ||w||? = w! w. Can you find a similar equation for matrices (see the Frobenius 
norm in Section 2.3.10)? 

5. Review the relationship between training error and generalization error. In addition to 
weight decay, increased training, and the use of a model of suitable complexity, what other 
ways can you think of to deal with overfitting? 

6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via 
P(w |x) x P(x | w)P(w). How can you identify P(w) with regularization? 

Discussions!” 
4.6 Dropout 


In Section 4.5, we introduced the classical approach to regularizing statistical models by penal- 
izing the Lə norm of the weights. In probabilistic terms, we could justify this technique by ar- 
guing that we have assumed a prior belief that weights take values from a Gaussian distribution 
with mean zero. More intuitively, we might argue that we encouraged the model to spread out its 
weights among many features rather than depending too much on a small number of potentially 
spurious associations. 





© https://discuss.d21.ai/t/98 
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4.6.1 Overfitting Revisited 


Faced with more features than examples, linear models tend to overfit. But given more examples 
than features, we can generally count on linear models notto overfit. Unfortunately, the reliability 
with which linear models generalize comes at a cost. Naively applied, linear models do not take 
into account interactions among features. For every feature, a linear model must assign either a 
positive or a negative weight, ignoring context. 


In traditional texts, this fundamental tension between generalizability and flexibility is described 
as the bias-variance tradeoff. Linear models have high bias: they can only represent a small class 
of functions. However, these models have low variance: they give similar results across different 
random samples of the data. 


Deep neural networks inhabit the opposite end of the bias-variance spectrum. Unlike linear mod- 
els, neural networks are not confined to looking at each feature individually. They can learn in- 
teractions among groups of features. For example, they might infer that “Nigeria” and “Western 
Union” appearing together in an email indicates spam but that separately they do not. 


Even when we have far more examples than features, deep neural networks are capable of over- 
fitting. In 2017, a group of researchers demonstrated the extreme flexibility of neural networks by 
training deep nets on randomly-labeled images. Despite the absence of any true pattern linking 
the inputs to the outputs, they found that the neural network optimized by stochastic gradient de- 
scent could label every image in the training set perfectly. Consider what this means. Ifthe labels 
are assigned uniformly at random and there are 10 classes, then no classifier can do better than 
10% accuracy on holdout data. The generalization gap here is a whopping 90%. If our models are 
so expressive that they can overfit this badly, then when should we expect them not to overfit? 


The mathematical foundations for the puzzling generalization properties of deep networks re- 
main open research questions, and we encourage the theoretically-oriented reader to dig deeper 
into the topic. For now, we turn to the investigation of practical tools that tend to empirically 
improve the generalization of deep nets. 


4.6.2 Robustness through Perturbations 


Let us think briefly about what we expect from a good predictive model. We want it to peform well 
on unseen data. Classical generalization theory suggests that to close the gap between train and 
test performance, we should aim for a simple model. Simplicity can come in the form of a small 
number of dimensions. We explored this when discussing the monomial basis functions of linear 
models in Section 4.4. Additionally, as we saw when discussing weight decay (Lə regularization) 
in Section 4.5, the (inverse) norm of the parameters also represents a useful measure of simplicity. 
Another useful notion of simplicity is smoothness, i.e., that the function should not be sensitive to 
small changes to its inputs. For instance, when we classify images, we would expect that adding 
some random noise to the pixels should be mostly harmless. 


In 1995, Christopher Bishop formalized this idea when he proved that training with input noise is 
equivalent to Tikhonov regularization (Bishop, 1995). This work drew a clear mathematical con- 
nection between the requirement that a function be smooth (and thus simple), and the require- 
ment that it be resilient to perturbations in the input. 


Then, in 2014, Srivastava et al. (Srivastava et al., 2014) developed a clever idea for how to apply 
Bishop’s idea to the internal layers of a network, too. Namely, they proposed to inject noise into 
each layer of the network before calculating the subsequent layer during training. They realized 
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that when training a deep network with many layers, injecting noise enforces smoothness just on 
the input-output mapping. 


Their idea, called dropout, involves injecting noise while computing each internal layer during 
forward propagation, and it has become a standard technique for training neural networks. The 
method is called dropout because we literally drop out some neurons during training. Throughout 
training, on each iteration, standard dropout consists of zeroing out some fraction ofthe nodes in 
each layer before calculating the subsequent layer. 


To be clear, we are imposing our own narrative with the link to Bishop. The original paper on 
dropout offers intuition through a surprising analogy to sexual reproduction. The authors argue 
that neural network overfitting is characterized by a state in which each layer relies on a specifc 
pattern of activations in the previous layer, calling this condition co-adaptation. Dropout, they 
claim, breaks up co-adaptation just as sexual reproduction is argued to break up co-adapted genes. 


The key challenge then is how to inject this noise. One idea is to inject the noise in an unbiased 
manner so that the expected value of each layer—while fixing the others—equals to the value it 
would have taken absent noise. 


In Bishop's work, he added Gaussian noise to the inputs to a linear model. At each training iter- 
ation, he added noise sampled from a distribution with mean zero e ~ N (0, 0”) to the input x, 
yielding a perturbed point x’ = x + e. In expectation, E[x'] = x. 


In standard dropout regularization, one debiases each layer by normalizing by the fraction of 
nodes that were retained (not dropped out). In other words, with dropout probability p, each inter- 
mediate activation h is replaced by a random variable h’ as follows: 


4.6.1 
+ otherwise ( ) 


F f with probability p 
1—p 


By design, the expectation remains unchanged, i.e., E{h’] = h. 


4.6.3 Dropout in Practice 


Recall the MLP with a hidden layer and 5 hidden units in Fig. 4.1.1. When we apply dropout to 
a hidden layer, zeroing out each hidden unit with probability p, the result can be viewed as a 
network containing only a subset of the original neurons. In Fig. 4.6.1, ha and hs are removed. 
Consequently, the calculation of the outputs no longer depends on ha or hs and their respective 
gradient also vanishes when performing backpropagation. In this way, the calculation of the out- 
put layer cannot be overly dependent on any one element of h,,...,hs. 
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Before dropout After dropout 





Fig. 4.6.1: MLP before and after dropout. 


Typically, we disable dropout at test time. Given a trained model and a new example, we do not 
drop out any nodes and thus do not need to normalize. However, there are some exceptions: some 
researchers use dropout at test time as a heuristic for estimating the uncertainty of neural network 
predictions: if the predictions agree across many different dropout masks, then we might say that 
the network is more confident. 


4.6.4 Implementation from Scratch 


To implement the dropout function for a single layer, we must draw as many samples from a 
Bernoulli (binary) random variable as our layer has dimensions, where the random variable takes 
value 1 (keep) with probability 1 — p and 0 (drop) with probability p. One easy way to implement 
this is to first draw samples from the uniform distribution U[O, 1]. Then we can keep those nodes 
for which the corresponding sample is greater than p, dropping the rest. 


In the following code, we implement a dropout_layer function that drops out the elements in the 
tensor input X with probability dropout, rescaling the remainder as described above: dividing the 
survivors by 1.0-dropout. 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


def dropout_layer(X, dropout): 
assert Q <= dropout <= 1 
# In this case, all elements are dropped out 
if dropout == 1: 
return np.zeros_like(X) 
# In this case, all elements are kept 
if dropout == 0: 
return X 
mask = np.random.uniform(0, 1, X.shape) > dropout 
return mask.astype(np.float32) x X / (1.0 - dropout) 


We can test out the dropout_layer function on a few examples. In the following lines of code, we 
pass our input X through the dropout operation, with probabilities 0, 0.5, and 1, respectively. 
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X = np.arange(16).reshape(2, 8) 
print(dropout_layer(X, 0)) 
print(dropout_layer(X, 0.5)) 
print(dropout_layer(X, 1)) 


LA ORO ell 
x WO, ML, 12, LS. al S.J] 
a. Ue Us 10 12, ld 


ZOO Ox 28, eth) 


0. 0. 0.1] 
0. 0. 0.1] 


Defining Model Parameters 


Again, we work with the Fashion-MNIST dataset introduced in Section 3.5. We define an MLP with 
two hidden layers containing 256 units each. 


num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256 


w1 
b1 
w2 
b2 
W3 
b3 


np. 
np. 
np. 
.zeros(num_hiddens2) 
np. 
np. 


np 


random.normal(scale=0.01, size=(num_inputs, num_hiddens1)) 
zeros(num_hiddens1) 
random.normal(scale=0.01, size=(num_hiddens1, num_hiddens2)) 


random.normal(scale=0.01, size=(num_hiddens2, num_outputs)) 
zeros(num_outputs) 


params = [W1, b1, W2, b2, W3, b3] 
for param in params: 
param.attach_grad() 


Defining the Model 


The model below applies dropout to the output of each hidden layer (following the activation func- 
tion). We can set dropout probabilities for each layer separately. A common trend is to set a lower 
dropout probability closer to the input layer. Below we set it to 0.2 and 0.5 for the first and second 
hidden layers, respectively. We ensure that dropout is only active during training. 


dropout1, dropout2 = 0.2, 0.5 


def net(X): 


X= 


X.reshape(-1, num_inputs) 


H1 = npx.relu(np.dot(X, W1) + b1) 
# Use dropout only when training the model 
if autograd.is_training(): 


H2 = 


# Add a dropout layer after the first fully connected layer 
H1 = dropout_layer(H1, dropout1) 
npx.relu(np.dot(H1, W2) + b2) 


if autograd.is_training(): 


# Add a dropout layer after the second fully connected layer 
H2 = dropout_layer(H2, dropout2) 


return np.dot(H2, W3) + b3 
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Training and Testing 


This is similar to the training and testing of MLPs described previously. 


num_epochs, Ir, batch_size = 10, 0.5, 256 

loss = gluon.loss.SoftmaxCrossEntropyLoss() 

train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 

d21.train_ch3(net, train_iter, test_iter, loss, num_epochs, 
lambda batch_size: d21.sgd(params, lr, batch_size)) 


— train loss 
=== train acc 


—-- test acc 





4.6.5 Concise Implementation 


With high-level APIs, all we need to do is add a Dropout layer after each fully-connected layer, 
passing in the dropout probability as the only argument to its constructor. During training, the 
Dropout layer will randomly drop out outputs of the previous layer (or equivalently, the inputs to 
the subsequent layer) according to the specified dropout probability. When not in training mode, 


the Dropout layer simply passes the data through during testing. 


net = nn.Sequential() 

net.add(nn.Dense(256, activation="relu”), 
# Add a dropout layer after the first fully connected layer 
nn.Dropout (dropout1) , 
nn.Dense(256, activation="relu”), 


# Add a dropout layer after the second fully connected layer 


nn.Dropout (dropout2) , 
nn.Dense(10)) 
net.initialize(init.Normal(sigma=0.01)) 


Next, we train and test the model. 


trainer = gluon.Trainer(net.collect_params(), 'sgd', {’learning_rate’: 
d21.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer) 


Ir}) 
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>= ZA platas 


—= train loss 
=== train acc 
—-= test acc 





Summary 
+ Beyond controlling the number of dimensions and the size of the weight vector, dropout is 
yet another tool to avoid overfitting. Often they are used jointly. 
* Dropout replaces an activation h with a random variable with expected value h. 


+ Dropout is only used during training. 


Exercises 


1. What happens if you change the dropout probabilities for the first and second layers? In 
particular, what happens if you switch the ones for both layers? Design an experiment to 
answer these questions, describe your results quantitatively, and summarize the qualitative 
takeaways. 


2. Increase the number of epochs and compare the results obtained when using dropout with 
those when not using it. 


3. What is the variance of the activations in each hidden layer when dropout is and is not ap- 
plied? Draw a plot to show how this quantity evolves over time for both models. 


4, Why is dropout not typically used at test time? 


5. Using the model in this section as an example, compare the effects of using dropout and 
weight decay. What happens when dropout and weight decay are used at the same time? 
Are the results additive? Are there diminished returns (or worse)? Do they cancel each other 
out? 


6. What happens if we apply dropout to the individual weights of the weight matrix rather than 
the activations? 


7. Invent another technique for injecting random noise at each layer that is different from the 
standard dropout technique. Can you develop a method that outperforms dropout on the 
Fashion-MNIST dataset (for a fixed architecture)? 


Discussions” 





7 https://discuss.d21.ai/t/100 
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4.7 Forward Propagation, Backward Propagation, and Computational 
Graphs 


So far, we have trained our models with minibatch stochastic gradient descent. However, when 
we implemented the algorithm, we only worried about the calculations involved in forward prop- 
agation through the model. When it came time to calculate the gradients, we just invoked the 
backpropagation function provided by the deep learning framework. 


The automatic calculation of gradients (automatic differentiation) profoundly simplifies the im- 
plementation of deep learning algorithms. Before automatic differentiation, even small changes 
to complicated models required recalculating complicated derivatives by hand. Surprisingly of- 
ten, academic papers had to allocate numerous pages to deriving update rules. While we must 
continue to rely on automatic differentiation so we can focus on the interesting parts, you ought 
to know how these gradients are calculated under the hood if you want to go beyond a shallow 
understanding of deep learning. 


In this section, we take a deep dive into the details of backward propagation (more commonly called 
backpropagation). To convey some insight for both the techniques and their implementations, we 
rely on some basic mathematics and computational graphs. To start, we focus our exposition on 
a one-hidden-layer MLP with weight decay (L2 regularization). 


4.7.1 Forward Propagation 


Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables 
(including outputs) for a neural network in order from the input layer to the output layer. We now 
work step-by-step through the mechanics of a neural network with one hidden layer. This may 
seem tedious but in the eternal words of funk virtuoso James Brown, you must “pay the cost to be 
the boss”. 


For the sake of simplicity, let us assume that the input example is x € R? and that our hidden layer 
does not include a bias term. Here the intermediate variable is: 


Z= Wx, (4.7.1) 


where W0) e R’*4 is the weight parameter of the hidden layer. After running the intermediate 
variable z € R” through the activation function ¢ we obtain our hidden activation vector of length 
h, 


The hidden variable his also an intermediate variable. Assuming that the parameters of the output 
layer only possess a weight of W?) e R1*}, we can obtain an output layer variable with a vector 
of length q: 

o= Wh. (4.7.3) 


Assuming that the loss function is / and the example label is y, we can then calculate the loss term 
for a single data example, 


L=1(0, y). (4.7.4) 
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According to the definition of La regularization, given the hyperparameter A, the regularization 
term is 


A 
s= 5 (WI + WI). (4.7.5) 


where the Frobenius norm of the matrix is simply the L2 norm applied after flattening the matrix 
into a vector. Finally, the model’s regularized loss on a given data example is: 


J=L+s. (4.7.6) 


We refer to J as the objective function in the following discussion. 


4.7.2 Computational Graph of Forward Propagation 


Plotting computational graphs helps us visualize the dependencies of operators and variables 
within the calculation. Fig. 4.7.1 contains the graph associated with the simple network described 
above, where squares denote variables and circles denote operators. The lower-left corner signi- 
fies the input and the upper-right corner is the output. Notice that the directions of the arrows 
(which illustrate data flow) are primarily rightward and upward. 





Fig. 4.7.1: Computational graph of forward propagation. 


4.7.3 Backpropagation 


Backpropagation refers to the method of calculating the gradient of neural network parameters. 
In short, the method traverses the network in reverse order, from the output to the input layer, 
according to the chain rule from calculus. The algorithm stores any intermediate variables (partial 
derivatives) required while calculating the gradient with respect to some parameters. Assume that 
we have functions Y = f(X) and Z = g(Y), in which the input and the output X, Y, Z are tensors of 
arbitrary shapes. By using the chain rule, we can compute the derivative of Z with respect to X via 


OZ OZ OY 


Here we use the prod operator to multiply its arguments after the necessary operations, such as 
transposition and swapping input positions, have been carried out. For vectors, this is straight- 
forward: it is simply matrix-matrix multiplication. For higher dimensional tensors, we use the 
appropriate counterpart. The operator prod hides all the notation overhead. 


Recall that the parameters of the simple network with one hidden layer, whose computational 
graph is in Fig. 4.7.1, are W) and W?. The objective of backpropagation is to calculate the gra- 
dients 8J /OW™ and 0.J/OW). To accomplish this, we apply the chain rule and calculate, in turn, 
the gradient of each intermediate variable and parameter. The order of calculations are reversed 
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relative to those performed in forward propagation, since we need to start with the outcome of the 
computational graph and work our way towards the parameters. The first step is to calculate the 
gradients of the objective function J = L + s with respect to the loss term L and the regularization 
term s. 

OJ OJ 

— = land — = 1. (4.7.8) 

OL Os 
Next, we compute the gradient of the objective function with respect to variable of the output layer 
O according to the chain rule: 


do 





= e R3. 4.7. 
OL” ðo Gr) 


OJ OJ OL OL 
ðo 


Next, we calculate the gradients of the regularization term with respect to both parameters: 


Os Os 
= (1) = (2) 4.7.1 
aw) AW? and WO) AW’. ( 0) 
Now we are able to calculate the gradient 9J/I9W% e R1*} of the model parameters closest to 
the output layer. Using the chain rule yields: 


ðJ | ðJ ðo ðJ Os _ Oss 2) 
wa) — prod (e sain) + prod (È , sm) = 30” + AW’. (4.7.11) 





To obtain the gradient with respect to W we need to continue backpropagation along the output 
layer to the hidden layer. The gradient with respect to the hidden layer’s outputs 0.J/Oh e R” is 
given by 


OJ OJ Oo TOS 
a oe 2’ E we) =. 4.7.12 
T ae (Z. 2) do La 
Since the activation function ¢ applies elementwise, calculating the gradient 9.J/0z € R” of the 
intermediate variable z requires that we use the elementwise multiplication operator, which we 
denote by ©: 





ðJ aJ MY al, 
a mo) 040: aa 


Finally, we can obtain the gradient 0.7/9W™ e R’*¢ of the model parameters closest to the input 
layer. According to the chain rule, we get 


OF ƏJ Əz at as NI 
A = prod (2) + prod (2 mn) = A 1 AW?” (4.7.14) 








4.7.4 Training Neural Networks 


When training neural networks, forward and backward propagation depend on each other. In 
particular, for forward propagation, we traverse the computational graph in the direction of de- 
pendencies and compute all the variables on its path. These are then used for backpropagation 
where the compute order on the graph is reversed. 


Take the aforementioned simple network as an example to illustrate. On one hand, computing the 
regularization term (4.7.5) during forward propagation depends on the current values of model 
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parameters W(") and W(?). They are given by the optimization algorithm according to backpropa- 
gation in the latest iteration. On the other hand, the gradient calculation for the parameter (4.7.11) 
during backpropagation depends on the current value of the hidden variable h, which is given by 
forward propagation. 


Therefore when training neural networks, after model parameters are initialized, we alternate 
forward propagation with backpropagation, updating model parameters using gradients given by 
backpropagation. Note that backpropagation reuses the stored intermediate values from forward 
propagation to avoid duplicate calculations. One of the consequences is that we need to retain 
the intermediate values until backpropagation is complete. This is also one of the reasons why 
training requires significantly more memory than plain prediction. Besides, the size of such in- 
termediate values is roughly proportional to the number of network layers and the batch size. 
Thus, training deeper networks using larger batch sizes more easily leads to out of memory errors. 


Summary 


Forward propagation sequentially calculates and stores intermediate variables within the 
computational graph defined by the neural network. It proceeds from the input to the output 
layer. 


Backpropagation sequentially calculates and stores the gradients of intermediate variables 
and parameters within the neural network in the reversed order. 


When training deep learning models, forward propagation and back propagation are inter- 
dependent. 


Training requires significantly more memory than prediction. 


Exercises 
1. Assume that the inputs X to some scalar function f are n x m matrices. What is the dimen- 
sionality of the gradient of f with respect to X? 


2. Add a bias to the hidden layer of the model described in this section (you do not need to 
include bias in the regularization term). 


1. Draw the corresponding computational graph. 
2. Derive the forward and backward propagation equations. 


3. Compute the memory footprint for training and prediction in the model described in this 
section. 


4. Assume that you want to compute second derivatives. What happens to the computational 
graph? How long do you expect the calculation to take? 


5. Assume that the computational graph is too large for your GPU. 
1. Can you partition it over more than one GPU? 
2. What are the advantages and disadvantages over training on a smaller minibatch? 


Discussions’! 





1 https://discuss.d21.ai/t/102 
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4.8 Numerical Stability and Initialization 


Thus far, every model that we have implemented required that we initialize its parameters accord- 
ing to some pre-specified distribution. Until now, we took the initialization scheme for granted, 
glossing over the details of how these choices are made. You might have even gotten the impres- 
sion that these choices are not especially important. To the contrary, the choice of initialization 
scheme plays a significant role in neural network learning, and it can be crucial for maintaining 
numerical stability. Moreover, these choices can be tied up in interesting ways with the choice of 
the nonlinear activation function. Which function we choose and how we initialize parameters 
can determine how quickly our optimization algorithm converges. Poor choices here can cause us 
to encounter exploding or vanishing gradients while training. In this section, we delve into these 
topics with greater detail and discuss some useful heuristics that you will find useful throughout 
your career in deep learning. 


4.8.1 Vanishing and Exploding Gradients 


Consider a deep network with L layers, input x and output o. With each layer l defined by a trans- 
formation f, parameterized by weights W(), whose hidden variable is h (let h® = x), our net- 
work can be expressed as: 


h” = f,(h’-)) and thus o = fr o ... o f(x). (4.8.1) 


If all the hidden variables and the input are vectors, we can write the gradient of o with respect to 
any set of parameters W\ as follows: 


woo = 0h 1-yaD -...- Ipoh? yaa. 
—— — sO (4.8.2) 
Mm MOD yt 
In other words, this gradient is the product of L — l matrices MU) .....M(+1 and the gradient vec- 


tor ví”, Thus we are susceptible to the same problems of numerical underflow that often crop up 
when multiplying together too many probabilities. When dealing with probabilities, a common 
trick is to switch into log-space, i.e., shifting pressure from the mantissa to the exponent of the 
numerical representation. Unfortunately, our problem above is more serious: initially the matri- 
ces M may have a wide variety of eigenvalues. They might be small or large, and their product 
might be very large or very small. 


The risks posed by unstable gradients go beyond numerical representation. Gradients of unpre- 
dictable magnitude also threaten the stability of our optimization algorithms. We may be fac- 
ing parameter updates that are either (i) excessively large, destroying our model (the exploding 
gradient problem); or (ii) excessively small (the vanishing gradient problem), rendering learning 
impossible as parameters hardly move on each update. 





170 Chapter 4. Multilayer Perceptrons 


Vanishing Gradients 


One frequent culprit causing the vanishing gradient problem is the choice of the activation func- 
tion o that is appended following each layer's linear operations. Historically, the sigmoid function 
1/(1+exp(—<)) (introduced in Section 4.1) was popular because it resembles a thresholding func- 
tion. Since early artificial neural networks were inspired by biological neural networks, the idea 
of neurons that fire either fully or not at all (like biological neurons) seemed appealing. Let us take 
a closer look at the sigmoid to see why it can cause vanishing gradients. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, np, npx 
npx.set_np() 


x = np.arange(-8.0, 8.0, 0.1) 
x.attach_grad() 
with autograd.record(): 
y = npx.sigmoid(x) 
y .backward() 


d21.plot(x, Ly, x.grad], legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5)) 


— sigmoid 
=-=- gradient 
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As you can see, the sigmoid's gradient vanishes both when its inputs are large and when they are 
small. Moreover, when backpropagating through many layers, unless we are in the Goldilocks 
zone, where the inputs to many of the sigmoids are close to zero, the gradients of the overall 
product may vanish. When our network boasts many layers, unless we are careful, the gradient 
will likely be cut off at some layer. Indeed, this problem used to plague deep network training. 
Consequently, ReLUs, which are more stable (but less neurally plausible), have emerged as the 
default choice for practitioners. 
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Exploding Gradients 


The opposite problem, when gradients explode, can be similarly vexing. To illustrate this a bit 
better, we draw 100 Gaussian random matrices and multiply them with some initial matrix. For 
the scale that we picked (the choice of the variance o? = 1), the matrix product explodes. When 
this happens due to the initialization of a deep network, we have no chance of getting a gradient 
descent optimizer to converge. 


M = np.random.normal(size=(4, 4)) 
print(’a single matrix’, M) 
for i in range(100): 
M = np.dot(M, np.random.normal(size=(4, 4))) 


print('after multiplying 100 matrices’, M) 


a single matrix [[ 2.2122064 1.1630787 0.7740038 0.4838046 ] 
C 1.0434405 0.29956347 1.1839255 0.15302546] 
[ 1.8917114 -1.1688148 -1.2347414 1.5580711 ] 
[AOS -0.5459446 -0.45138445 -2.3556297 ]] 
after multiplying 100 matrices [[ 3.4459714e+23 -7.8040680e+23 5.9973287e+23 4.5229990e+23] 
[ 2.5275089e+23 -5.7240326e+23 4.3988473e+23 3.3174740e+23] 
[ 1.3731286e+24 -3.1097155e+24 2.3897773e+24 1.8022959e+24] 
[-4,4951040e+23 1.0180033e+24 -7.8232281e+23 -5.9000354e+23]] 


Breaking the Symmetry 


Another problem in neural network design is the symmetry inherent in their parametrization. 
Assume that we have a simple MLP with one hidden layer and two units. In this case, we could 
permute the weights W“) of the first layer and likewise permute the weights of the output layer 
to obtain the same function. There is nothing special differentiating the first hidden unit vs. the 
second hidden unit. In other words, we have permutation symmetry among the hidden units of 
each layer. 


This is more than just a theoretical nuisance. Consider the aforementioned one-hidden-layer MLP 
with two hidden units. For illustration, suppose that the output layer transforms the two hidden 
units into only one output unit. Imagine what would happen if we initialized all of the parameters 
of the hidden layer as WC) = c for some constant c. In this case, during forward propagation 
either hidden unit takes the same inputs and parameters, producing the same activation, which 
is fed to the output unit. During backpropagation, differentiating the output unit with respect to 
parameters W\) gives a gradient whose elements all take the same value. Thus, after gradient- 
based iteration (e.g., minibatch stochastic gradient descent), all the elements of W“ still take the 
same value. Such iterations would never break the symmetry on its own and we might never be 
able to realize the network’s expressive power. The hidden layer would behave as if it had only a 
single unit. Note that while minibatch stochastic gradient descent would not break this symmetry, 
dropout regularization would! 
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4.8.2 Parameter Initialization 


One way of addressing—or at least mitigating—the issues raised above is through careful initial- 
ization. Additional care during optimization and suitable regularization can further enhance sta- 
bility. 


Default Initialization 


In the previous sections, e.g., in Section 3.3, we used a normal distribution to initialize the values 
of our weights. If we do not specify the initialization method, the framework will use a default 
random initialization method, which often works well in practice for moderate problem sizes. 


Xavier Initialization 


Let us look at the scale distribution of an output (e.g., a hidden variable) o, for some fully- 
connected layer without nonlinearities. With nin inputs x; and their associated weights w;; for this 
layer, an output is given by 


Nin 
Oi = y WijTj. (4.8.3) 
j=l 


The weights w;; are all drawn independently from the same distribution. Furthermore, let us 
assume that this distribution has zero mean and variance o”. Note that this does not mean that 
the distribution has to be Gaussian, just that the mean and variance need to exist. For now, let 
us assume that the inputs to the layer x; also have zero mean and variance y? and that they are 
independent of w;; and independent of each other. In this case, we can compute the mean and 
variance of o; as follows: 


Elo] = Y Ejes] 
j=1 


Nin 


=>) Elwi¡ Eley] 
j=1 


=0, 
Var[o;] = Elo?] — (Elo;] y? (4.8.4) 
= y Elw;,25] — 0 
j=1 


=>) Elvi] El3] 
j=l 
= Nino Y. 


One way to keep the variance fixed is to set nino? = 1. Now consider backpropagation. There 
we face a similar problem, albeit with gradients being propagated from the layers closer to the 
output. Using the same reasoning as for forward propagation, we see that the gradients’ variance 
can blow up unless noyto? = 1, where nout is the number of outputs of this layer. This leaves us in 
a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to 
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satisfy: 


1 : 2 

=(nin + Nout)o? = 1 or equivalently o = ,/—————_. (4.8.5) 
q y 

2 Nin + Nout 


This is the reasoning underlying the now-standard and practically beneficial Xavier initialization, 
named after the first author of its creators (Glorot & Bengio, 2010). Typically, the Xavier initializa- 
tion samples weights from a Gaussian distribution with zero mean and variance o? = a A . We 
can also adapt Xavier’s intuition to choose the variance when sampling weights from a uniform 
distribution. Note that the uniform distribution U(—a, a) has variance i Plugging a into our 


condition on g? yields the suggestion to initialize according to 


u( y g y 2 ) (4.8.6) 
Nin + Nout Nin + Nout 


Though the assumption for nonexistence of nonlinearities in the above mathematical reasoning 
can be easily violated in neural networks, the Xavier initialization method turns out to work well 
in practice. 











Beyond 


The reasoning above barely scratches the surface of modern approaches to parameter initializa- 
tion. A deep learning framework often implements over a dozen different heuristics. Moreover, 
parameter initialization continues to be a hot area of fundamental research in deep learning. 
Among these are heuristics specialized for tied (shared) parameters, super-resolution, sequence 
models, and other situations. For instance, Xiao et al. demonstrated the possibility of training 
10000-layer neural networks without architectural tricks by using a carefully-designed initializa- 
tion method (Xiao et al., 2018). 


If the topic interests you we suggest a deep dive into this module’s offerings, reading the papers 
that proposed and analyzed each heuristic, and then exploring the latest publications on the topic. 
Perhaps you will stumble across or even invent a clever idea and contribute an implementation to 
deep learning frameworks. 


Summary 


Vanishing and exploding gradients are common issues in deep networks. Great care in pa- 
rameter initialization is required to ensure that gradients and parameters remain well con- 
trolled. 


Initialization heuristics are needed to ensure that the initial gradients are neither too large 
nor too small. 


ReLU activation functions mitigate the vanishing gradient problem. This can accelerate con- 
vergence. 


Random initialization is key to ensure that symmetry is broken before optimization. 


Xavier initialization suggests that, for each layer, variance of any output is not affected by 
the number of inputs, and variance of any gradient is not affected by the number of outputs. 
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Exercises 


1. Can you design other cases where a neural network might exhibit symmetry requiring break- 
ing besides the permutation symmetry in an MLP’s layers? 


2. Can we initialize all weight parameters in linear regression or in softmax regression to the 
same value? 


3. Look up analytic bounds on the eigenvalues of the product of two matrices. What does this 
tell you about ensuring that gradients are well conditioned? 


4. If we know that some terms diverge, can we fix this after the fact? Look at the paper on 
layer-wise adaptive rate scaling for inspiration (You et al., 2017). 


Discussions”? 


4.9 Environment and Distribution Shift 


In the previous sections, we worked through a number of hands-on applications of machine learn- 
ing, fitting models to a variety of datasets. And yet, we never stopped to contemplate either where 
data come from in the first place or what we plan to ultimately do with the outputs from our mod- 
els. Too often, machine learning developers in possession of data rush to develop models without 
pausing to consider these fundamental issues. 


Many failed machine learning deployments can be traced back to this pattern. Sometimes mod- 
els appear to perform marvelously as measured by test set accuracy but fail catastrophically in 
deployment when the distribution of data suddenly shifts. More insidiously, sometimes the very 
deployment of a model can be the catalyst that perturbs the data distribution. Say, for example, 
that we trained a model to predict who will repay vs. default on a loan, finding that an applicant's 
choice of footwear was associated with the risk of default (Oxfords indicate repayment, sneakers 
indicate default). We might be inclined to thereafter grant loans to all applicants wearing Oxfords 
and to deny all applicants wearing sneakers. 


In this case, our ill-considered leap from pattern recognition to decision-making and our failure 
to critically consider the environment might have disastrous consequences. For starters, as soon 
as we began making decisions based on footwear, customers would catch on and change their 
behavior. Before long, all applicants would be wearing Oxfords, without any coinciding improve- 
ment in credit-worthiness. Take a minute to digest this because similar issues abound in many 
applications of machine learning: by introducing our model-based decisions to the environment, 
we might break the model. 


While we cannot possibly give these topics a complete treatment in one section, we aim here to 
expose some common concerns, and to stimulate the critical thinking required to detect these 
situations early, mitigate damage, and use machine learning responsibly. Some of the solutions 
are simple (ask for the “right” data), some are technically difficult (implement a reinforcement 
learning system), and others require that we step outside the realm of statistical prediction al- 
together and grapple with difficult philosophical questions concerning the ethical application of 
algorithms. 





” https://discuss.d21.ai/t/103 
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4.9.1 Types of Distribution Shift 


To begin, we stick with the passive prediction setting considering the various ways that data distri- 
butions might shift and what might be done to salvage model performance. In one classic setup, 
we assume that our training data were sampled from some distribution ps(x, y) but that our test 
data will consist of unlabeled examples drawn from some different distribution pr (x, y). Already, 
we must confront a sobering reality. Absent any assumptions on how ps and pr relate to each 
other, learning a robust classifier is impossible. 


Consider a binary classification problem, where we wish to distinguish between dogs and cats. 
If the distribution can shift in arbitrary ways, then our setup permits the pathological case in 
which the distribution over inputs remains constant: ps(x) = pr(x), but the labels are all flipped: 
ps(y|x) = 1 — pr(y|x). In other words, if God can suddenly decide that in the future all “cats” 
are now dogs and what we previously called “dogs” are now cats—without any change in the dis- 
tribution of inputs p(x), then we cannot possibly distinguish this setting from one in which the 
distribution did not change at all. 


Fortunately, under some restricted assumptions on the ways our data might change in the future, 
principled algorithms can detect shift and sometimes even adapt on the fly, improving on the 
accuracy of the original classifier. 


Covariate Shift 


Among categories of distribution shift, covariate shift may be the most widely studied. Here, we 
assume that while the distribution of inputs may change over time, the labeling function, i.e., the 
conditional distribution P(y | x) does not change. Statisticians call this covariate shift because 
the problem arises due to a shift in the distribution of the covariates (features). While we can 
sometimes reason about distribution shift without invoking causality, we note that covariate shift 
is the natural assumption to invoke in settings where we believe that x causes y. 


Consider the challenge of distinguishing cats and dogs. Our training data might consist of images 
of the kind in Fig. 4.9.1. 


cat cat dog 





Fig. 4.9.1: Training data for distinguishing cats and dogs. 


At test time we are asked to classify the images in Fig. 4.9.2. 
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cat cat dog dog 
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Fig. 4.9.2: Test data for distinguishing cats and dogs. 





The training set consists of photos, while the test set contains only cartoons. Training on a dataset 
with substantially different characteristics from the test set can spell trouble absent a coherent 
plan for how to adapt to the new domain. 


Label Shift 


Label shift describes the converse problem. Here, we assume that the label marginal P(y) can 
change but the class-conditional distribution P(x | y) remains fixed across domains. Label shift is 
a reasonable assumption to make when we believe that y causes x. For example, we may want to 
predict diagnoses given their symptoms (or other manifestations), even as the relative prevalence 
of diagnoses are changing over time. Label shift is the appropriate assumption here because dis- 
eases cause symptoms. In some degenerate cases the label shift and covariate shift assumptions 
can hold simultaneously. For example, when the label is deterministic, the covariate shift assump- 
tion will be satisfied, even when y causes x. Interestingly, in these cases, it is often advantageous 
to work with methods that flow from the label shift assumption. That is because these methods 
tend to involve manipulating objects that look like labels (often low-dimensional), as opposed to 
objects that look like inputs, which tend to be high-dimensional in deep learning. 


Concept Shift 


We may also encounter the related problem of concept shift, which arises when the very definitions 
of labels can change. This sounds weird—a cat is a cat, no? However, other categories are subject to 
changes in usage over time. Diagnostic criteria for mental illness, what passes for fashionable, and 
job titles, are all subject to considerable amounts of concept shift. It turns out that if we navigate 
around the United States, shifting the source of our data by geography, we will find considerable 
concept shift regarding the distribution of names for soft drinks as shown in Fig. 4.9.3. 
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Fig. 4.9.3: Concept shift on soft drink names in the United States. 


If we were to build a machine translation system, the distribution P(y | x) might be different de- 
pending on our location. This problem can be tricky to spot. We might hope to exploit knowledge 
that shift only takes place gradually either in a temporal or geographic sense. 


4.9.2 Examples of Distribution Shift 


Before delving into formalism and algorithms, we can discuss some concrete situations where 
covariate or concept shift might not be obvious. 


Medical Diagnostics 


Imagine that you want to design an algorithm to detect cancer. You collect data from healthy and 
sick people and you train your algorithm. It works fine, giving you high accuracy and you conclude 
that you are ready for a successful career in medical diagnostics. Not so fast. 


The distributions that gave rise to the training data and those you will encounter in the wild might 
differ considerably. This happened to an unfortunate startup that some of us (authors) worked 
with years ago. They were developing a blood test for a disease that predominantly affects older 
men and hoped to study it using blood samples that they had collected from patients. However, it 
is considerably more difficult to obtain blood samples from healthy men than sick patients already 
in the system. To compensate, the startup solicited blood donations from students on a university 
campus to serve as healthy controls in developing their test. Then they asked whether we could 
help them to build a classifier for detecting the disease. 


As we explained to them, it would indeed be easy to distinguish between the healthy and sick 
cohorts with near-perfect accuracy. However, that is because the test subjects differed in age, 
hormone levels, physical activity, diet, alcohol consumption, and many more factors unrelated 
to the disease. This was unlikely to be the case with real patients. Due to their sampling proce- 
dure, we could expect to encounter extreme covariate shift. Moreover, this case was unlikely to 
be correctable via conventional methods. In short, they wasted a significant sum of money. 
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Self-Driving Cars 


Say a company wanted to leverage machine learning for developing self-driving cars. One key 
component here is a roadside detector. Since real annotated data are expensive to get, they had 
the (smart and questionable) idea to use synthetic data from a game rendering engine as additional 
training data. This worked really well on “test data” drawn from the rendering engine. Alas, inside 
a real car it was a disaster. As it turned out, the roadside had been rendered with a very simplis- 
tic texture. More importantly, all the roadside had been rendered with the same texture and the 
roadside detector learned about this “feature” very quickly. 


A similar thing happened to the US Army when they first tried to detect tanks in the forest. They 
took aerial photographs of the forest without tanks, then drove the tanks into the forest and took 
another set of pictures. The classifier appeared to work perfectly. Unfortunately, it had merely 
learned howto distinguish trees with shadows from trees without shadows—the first set of pictures 
was taken in the early morning, the second set at noon. 


Nonstationary Distributions 


A much more subtle situation arises when the distribution changes slowly (also known as nonsta- 
tionary distribution) and the model is not updated adequately. Below are some typical cases. 


e We train a computational advertising model and then fail to update it frequently (e.g., we 
forget to incorporate that an obscure new device called an iPad was just launched). 


e We build a spam filter. It works well at detecting all spam that we have seen so far. But then 
the spammers wisen up and craft new messages that look unlike anything we have seen 
before. 


e We build a product recommendation system. It works throughout the winter but then con- 
tinues to recommend Santa hats long after Christmas. 


More Anecdotes 


e We build a face detector. It works well on all benchmarks. Unfortunately it fails on test 
data—the offending examples are close-ups where the face fills the entire image (no such 
data were in the training set). 


e We build a Web search engine for the US market and want to deploy it in the UK. 


e We train an image classifier by compiling a large dataset where each among a large set of 
classes is equally represented in the dataset, say 1000 categories, represented by 1000 images 
each. Then we deploy the system in the real world, where the actual label distribution of 
photographs is decidedly non-uniform. 
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4.9.3 Correction of Distribution Shift 


As we have discussed, there are many cases where training and test distributions P(x, y) are dif- 
ferent. In some cases, we get lucky and the models work despite covariate, label, or concept shift. 
In other cases, we can do better by employing principled strategies to cope with the shift. The re- 
mainder of this section grows considerably more technical. The impatient reader could continue 
on to the next section as this material is not prerequisite to subsequent concepts. 


Empirical Risk and True Risk 


Let us first reflect about what exactly is happening during model training: we iterate over features 
and associated labels of training data { (x1, y1), -- . , (Xn, Yn) y and update the parameters of a model 
f after every minibatch. For simplicity we do not consider regularization, so we largely minimize 
the loss on the training: 


minimize — Uf i 4.9.1 
ir DU ian: (4.9.1) 


where / is the loss function measuring “how bad” the prediction f(x;) is given the associated la- 
bel y;. Statisticians call the term in (4.9.1) empirical risk. Empirical risk is an average loss over 
the training data to approximate the true risk, which is the expectation of the loss over the entire 
population of data drawn from their true distribution p(x, y): 


Eto yy (EC l= f [irw p(x, y) dxdy. (4.9.2) 


However, in practice we typically cannot obtain the entire population of data. Thus, empirical risk 
minimization, which is minimizing empirical risk in (4.9.1), is a practical strategy for machine 
learning, with the hope to approximate minimizing true risk. 


Covariate Shift Correction 


Assume that we want to estimate some dependency P(y | x) for which we have labeled data (x;, y;). 
Unfortunately, the observations x; are drawn from some source distribution q(x) rather than the 
target distribution p(x). Fortunately, the dependency assumption means that the conditional dis- 
tribution does not change: p(y | x) = q(y | x). If the source distribution q(x) is “wrong”, we can 
correct for that by using the following simple identity in true risk: 


[ fusco. mean axay = [fura | wa) cay. (49) 


In other words, we need to reweigh each data example by the ratio of the probability that it would 
have been drawn from the correct distribution to that from the wrong one: 


gate (4.9.4) 





Plugging in the weight 3; for each data example (x;, y;) we can train our model using weighted 
empirical risk minimization: 


minimize : > Bill f (Xi), yi). (4.9.5) 
i=1 
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Alas, we do not know that ratio, so before we can do anything useful we need to estimate it. Many 
methods are available, including some fancy operator-theoretic approaches that attempt to recali- 
brate the expectation operator directly using a minimum-norm or a maximum entropy principle. 
Note that for any such approach, we need samples drawn from both distributions—the “true” p, 
e.g., by access to test data, and the one used for generating the training set q (the latter is trivially 
available). Note however, that we only need features x ~ p(x); we do not need to access labels 
y ~ p(y). 


In this case, there exists a very effective approach that will give almost as good results as the orig- 
inal: logistic regression, which is a special case of softmax regression (see Section 3.4) for binary 
classification. This is all that is needed to compute estimated probability ratios. We learn a classi- 
fier to distinguish between data drawn from p(x) and data drawn from q(x). If it is impossible to 
distinguish between the two distributions then it means that the associated instances are equally 
likely to come from either one of the two distributions. On the other hand, any instances that can 
be well discriminated should be significantly overweighted or underweighted accordingly. 


For simplicity’s sake assume that we have an equal number of instances from both distributions 
p(x) and q(x), respectively. Now denote by z labels that are 1 for data drawn from p and —1 for 
data drawn from q. Then the probability in a mixed dataset is given by 

P(z=1|x) _ p(x) 


p(x) 7 
Aaa and hence Pesa aay (4.9.6) 


1 
Tp) 


P(z =1|x) = 





Thus, if we use a logistic regression approach, where P(z = 1 | x) = (h is a parame- 


terized function), it follows that 





8, = 1/41 + exp(—h(x;))) 
exp(—h(xi))/(1 + exp(—h(xi))) 
As aresult, we need to solve two problems: first one to distinguish between data drawn from both 


distributions, and then a weighted empirical risk minimization problem in (4.9.5) where we weigh 
terms by £;i. 


= exp(h(x;)). (4.9.7) 


Now we are ready to describe a correction algorithm. Suppose that we have a training set 
{(X1,Y1),--+; (Xn, Yn)} and an unlabeled test set (u;,...,u,,). For covariate shift, we assume that 
X; for all 1 < i < n are drawn from some source distribution and u; for all 1 < i < m are drawn 
from the target distribution. Here is a prototypical algorithm for correcting covariate shift: 


1. Generate a binary-classification training set: {(x1,—1),...,(Xn,—1), (u1, 1),..., (Um, 1)). 
2. Train a binary classifier using logistic regression to get function h. 


3. Weigh training data using 3; = exp(h(x;)) or better 6; = min(exp(h(x;)),c) for some con- 
stant c. 


4. Use weights 6; for training on {(X1, y1),.--, (Xn, Yn) } in (4.9.5). 


Note that the above algorithm relies on a crucial assumption. For this scheme to work, we need 
that each data example in the target (e.g., test time) distribution had nonzero probability of oc- 
curring at training time. If we find a point where p(x) > 0 but q(x) = 0, then the corresponding 
importance weight should be infinity. 
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Label Shift Correction 


Assume that we are dealing with a classification task with k categories. Using the same notation in 
Section 4.9.3, q and p are the source distribution (e.g., training time) and target distribution (e.g., 
test time), respectively. Assume that the distribution of labels shifts over time: q(y) 4 p(y), but 
the class-conditional distribution stays the same: q(x | y) = p(x | y). If the source distribution 
q(y) is “wrong”, we can correct for that according to the following identity in true risk as defined 


in (4.9.2): 
[ fusrro.wnl meo aay = ff usa yala | yay) dxdy. (4.9.8) 


Here, our importance weights will correspond to the label likelihood ratios 


gers (yi) (4.9.9) 
a(yi) 





One nice thing about label shift is that if we have a reasonably good model on the source distribu- 
tion, then we can get consistent estimates of these weights without ever having to deal with the 
ambient dimension. In deep learning, the inputs tend to be high-dimensional objects like images, 
while the labels are often simpler objects like categories. 


To estimate the target label distribution, we first take our reasonably good off-the-shelf classifier 
(typically trained on the training data) and compute its confusion matrix using the validation set 
(also from the training distribution). The confusion matrix, C, is simply a k x k matrix, where 
each column corresponds to the label category (ground truth) and each row corresponds to our 
model's predicted category. Each cell's value c;; is the fraction of total predictions on the validation 
set where the true label was j and our model predicted i. 


Now, we cannot calculate the confusion matrix on the target data directly, because we do not get 
to see the labels for the examples that we see in the wild, unless we invest in a complex real-time 
annotation pipeline. What we can do, however, is average all of our models predictions at test 
time together, yielding the mean model outputs (y) € R*, whose i™® element u(ĝ;) is the fraction 
of total predictions on the test set where our model predicted i. 


It turns out that under some mild conditions—if our classifier was reasonably accurate in the first 
place, and if the target data contain only categories that we have seen before, and if the label shift 
assumption holds in the first place (the strongest assumption here), then we can estimate the test 
set label distribution by solving a simple linear system 


Cp(y) = u(y), (4.9.10) 


because as an estimate i cijplyj) = (ĝi) holds for all 1 < i < k, where p(y;) is the j™ element 
of the k-dimensional label distribution vector p(y). If our classifier is sufficiently accurate to begin 
with, then the confusion matrix C will be invertible, and we get a solution p(y) = Ct (y). 


Because we observe the labels on the source data, it is easy to estimate the distribution q(y). Then 
for any training example i with label y;, we can take the ratio of our estimated p(y;)/q(y;) to cal- 
culate the weight 5,, and plug this into weighted empirical risk minimization in (4.9.5). 
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Concept Shift Correction 


Concept shift is much harder to fix in a principled manner. For instance, in a situation where 
suddenly the problem changes from distinguishing cats from dogs to one of distinguishing white 
from black animals, it will be unreasonable to assume that we can do much better than just col- 
lecting new labels and training from scratch. Fortunately, in practice, such extreme shifts are 
rare. Instead, what usually happens is that the task keeps on changing slowly. To make things 
more concrete, here are some examples: 


e In computational advertising, new products are launched, old products become less popular. 
This means that the distribution over ads and their popularity changes gradually and any 
click-through rate predictor needs to change gradually with it. 


° Traffic camera lenses degrade gradually due to environmental wear, affecting image quality 
progressively. 


e News content changes gradually (i.e., most of the news remains unchanged but new stories 
appear). 


In such cases, we can use the same approach that we used for training networks to make them 
adapt to the change in the data. In other words, we use the existing network weights and simply 
perform a few update steps with the new data rather than training from scratch. 


4.9.4 A Taxonomy of Learning Problems 


Armed with knowledge about how to deal with changes in distributions, we can now consider 
some other aspects of machine learning problem formulation. 


Batch Learning 


In batch learning, we have access to training features and labels { (x1, y1),..., (Xn, Yn) }, which we 
use to train a model f(x). Later on, we deploy this model to score new data (x, y) drawn from the 
same distribution. This is the default assumption for any of the problems that we discuss here. For 
instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained 
it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then 
installed in a customer’s home and is never updated again (barring extreme circumstances). 


Online Learning 


Now imagine that the data (x;, y;) arrives one sample at a time. More specifically, assume that 
we first observe x;, then we need to come up with an estimate f(x;) and only once we have done 
this, we observe y; and with it, we receive a reward or incur a loss, given our decision. Many 
real problems fall into this category. For example, we need to predict tomorrow’s stock price, this 
allows us to trade based on that estimate and at the end of the day we find out whether our estimate 
allowed us to make a profit. In other words, in online learning, we have the following cycle where 
we are continuously improving our model given new observations. 


model fà — data x; —> estimate f(x) — observation y; — loss l(y, f:(x;)) — model fipı 
(4.9.11) 
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Bandits 


Bandits are a special case of the problem above. While in most learning problems we have a con- 
tinuously parametrized function f where we want to learn its parameters (e.g., a deep network), 
in a bandit problem we only have a finite number of arms that we can pull, i.e., a finite number 
of actions that we can take. It is not very surprising that for this simpler problem stronger theo- 
retical guarantees in terms of optimality can be obtained. We list it mainly since this problem is 
often (confusingly) treated as if it were a distinct learning setting. 


Control 


In many cases the environment remembers what we did. Not necessarily in an adversarial manner 
but it will just remember and the response will depend on what happened before. For instance, a 
coffee boiler controller will observe different temperatures depending on whether it was heating 
the boiler previously. PID (proportional-integral-derivative) controller algorithms are a popular 
choice there. Likewise, a user’s behavior on a news site will depend on what we showed him 
previously (e.g., he will read most news only once). Many such algorithms form a model of the 
environment in which they act such as to make their decisions appear less random. Recently, 
control theory (e.g., PID variants) has also been used to automatically tune hyperparameters to 
achive better disentangling and reconstruction quality, and improve the diversity of generated 
text and the reconstruction quality of generated images (Shao et al., 2020). 


Reinforcement Learning 


In the more general case of an environment with memory, we may encounter situations where 
the environment is trying to cooperate with us (cooperative games, in particular for non-zero-sum 
games), or others where the environment will try to win. Chess, Go, Backgammon, or StarCraft 
are some of the cases in reinforcement learning. Likewise, we might want to build a good controller 
for autonomous cars. The other cars are likely to respond to the autonomous car’s driving style in 
nontrivial ways, e.g., trying to avoid it, trying to cause an accident, and trying to cooperate with 
it. 


Considering the Environment 


One key distinction between the different situations above is that the same strategy that might have 
worked throughout in the case of a stationary environment, might not work throughout when the 
environment can adapt. For instance, an arbitrage opportunity discovered by a trader is likely to 
disappear once he starts exploiting it. The speed and manner at which the environment changes 
determines to a large extent the type of algorithms that we can bring to bear. For instance, if we 
know that things may only change slowly, we can force any estimate to change only slowly, too. If 
we know that the environment might change instantaneously, but only very infrequently, we can 
make allowances for that. These types of knowledge are crucial for the aspiring data scientist to 
deal with concept shift, i.e., when the problem that he is trying to solve changes over time. 
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4.9.5 Fairness, Accountability, and Transparency in Machine Learning 


Finally, it is important to remember that when you deploy machine learning systems you are not 
merely optimizing a predictive model—you are typically providing a tool that will be used to (par- 
tially or fully) automate decisions. These technical systems can impact the lives of individuals 
subject to the resulting decisions. The leap from considering predictions to decisions raises not 
only new technical questions, but also a slew of ethical questions that must be carefully consid- 
ered. If we are deploying a medical diagnostic system, we need to know for which populations 
it may work and which it may not. Overlooking foreseeable risks to the welfare of a subpopula- 
tion could cause us to administer inferior care. Moreover, once we contemplate decision-making 
systems, we must step back and reconsider how we evaluate our technology. Among other con- 
sequences of this change of scope, we will find that accuracy is seldom the right measure. For 
instance, when translating predictions into actions, we will often want to take into account the 
potential cost sensitivity of erring in various ways. If one way of misclassifying an image could 
be perceived as a racial sleight of hand, while misclassification to a different category would be 
harmless, then we might want to adjust our thresholds accordingly, accounting for societal val- 
ues in designing the decision-making protocol. We also want to be careful about how prediction 
systems can lead to feedback loops. For example, consider predictive policing systems, which al- 
locate patrol officers to areas with high forecasted crime. It is easy to see how a worrying pattern 
can emerge: 


1. Neighborhoods with more crime get more patrols. 


2. Consequently, more crimes are discovered in these neighborhoods, entering the training 
data available for future iterations. 


3. Exposed to more positives, the model predicts yet more crime in these neighborhoods. 


4. In the next iteration, the updated model targets the same neighborhood even more heavily 
leading to yet more crimes discovered, etc. 


Often, the various mechanisms by which a model's predictions become coupled to its training data 
are unaccounted for in the modeling process. This can lead to what researchers call runaway feed- 
back loops. Additionally, we want to be careful about whether we are addressing the right problem 
in the first place. Predictive algorithms now play an outsize role in mediating the dissemination of 
information. Should the news that an individual encounters be determined by the set of Facebook 
pages they have Liked? These are just a few among the many pressing ethical dilemmas that you 
might encounter in a career in machine learning. 


Summary 


In many cases training and test sets do not come from the same distribution. This is called 
distribution shift. 


True risk is the expectation of the loss over the entire population of data drawn from their 
true distribution. However, this entire population is usually unavailable. Empirical risk is 
an average loss over the training data to approximate the true risk. In practice, we perform 
empirical risk minimization. 


Under the corresponding assumptions, covariate and label shift can be detected and cor- 
rected for at test time. Failure to account for this bias can become problematic at test time. 


In some cases, the environment may remember automated actions and respond in surpris- 
ing ways. We must account for this possibility when building models and continue to mon- 
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itor live systems, open to the possibility that our models and the environment will become 
entangled in unanticipated ways. 


Exercises 


1. What could happen when we change the behavior of a search engine? What might the users 
do? What about the advertisers? 


2. Implement a covariate shift detector. Hint: build a classifier. 
3. Implement a covariate shift corrector. 
4. Besides distribution shift, what else could affect how empirical risk approximates true risk? 


Discussions”? 


4.10 Predicting House Prices on Kaggle 


Now that we have introduced some basic tools for building and training deep networks and reg- 
ularizing them with techniques including weight decay and dropout, we are ready to put all this 
knowledge into practice by participating in a Kaggle competition. The house price prediction 
competition is a great place to start. The data are fairly generic and do not exhibit exotic structure 
that might require specialized models (as audio or video might). This dataset, collected by Bart de 
Cock in 2011 (DeCock, 2011), covers house prices in Ames, IA from the period of 2006-2010. It is 
considerably larger than the famous Boston housing dataset”* of Harrison and Rubinfeld (1978), 
boasting both more examples and more features. 


In this section, we will walk you through details of data preprocessing, model design, and hyper- 
parameter selection. We hope that through a hands-on approach, you will gain some intuitions 
that will guide you in your career as a data scientist. 


4.10.1 Downloading and Caching Datasets 


Throughout the book, we will train and test models on various downloaded datasets. Here, we 
implement several utility functions to facilitate data downloading. First, we maintain a dictionary 
DATA_HUB that maps a string (the name of the dataset) to a tuple containing both the URL to locate 
the dataset and the SHA-1 key that verifies the integrity of the file. All such datasets are hosted at 
the site whose address is DATA_URL. 


import os 

import requests 
import zipfile 
import tarfile 
import hashlib 


#@save 
DATA_HUB = dict() 
DATA_URL = 'http://d21-data.s3-accelerate.amazonaws.com/’ 





73 https://discuss.d21.ai/t/105 
74 https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names 
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The following download function downloads a dataset, caching it in a local directory (../data by 
default) and returns the name ofthe downloaded file. Ifa file corresponding to this dataset already 
exists in the cache directory and its SHA-1 matches the one stored in DATA_HUB, our code will use 
the cached file to avoid clogging up your internet with redundant downloads. 
def download(name, cache_dir=os.path.join(’..’, 'data')): #@save 
"""Download a file inserted into DATA_HUB, return the local filename.””"” 
assert name in DATA_HUB, f”{name} does not exist in {DATA_HUB}.” 
url, shal_hash = DATA_HUB[name] 
os.makedirs(cache_dir, exist_ok=True) 
fname = os.path.join(cache_dir, url.split(’/’)[-1]) 
if os.path.exists(fname) : 
shal = hashlib.sha1() 
with open(fname, 'rb') as f: 
while True: 
data = f.read(1048576) 
if not data: 
break 
shal.update(data) 
if shal.hexdigest() == shal_hash: 
return fname + Hit cache 
print(f’Downloading {fname} from {url}...') 
r = requests.get(url, stream=True, verify=True) 
with open(fname, 'wb') as f: 
f .write(r.content) 
return fname 


We also implement two additional utility functions: one is to download and extract a zip or tar 
file and the other to download all the datasets used in this book from DATA_HUB into the cache 
directory. 


def download_extract(name, folder=None): #@save 
"""Download and extract a zip/tar file.””" 
fname = download(name) 
base_dir = os.path.dirname(fname) 
data_dir, ext = os.path.splitext(fname) 
if ext == '.zip': 
fp = zipfile.ZipFile(fname, 'r') 
elit ext in (Corte. “oz Ne 
fp = tarfile.open(fname, 'r') 
else: 
assert False, ‘Only zip/tar files can be extracted. ’ 
fp.extractall(base_dir) 
return os.path.join(base_dir, folder) if folder else data_dir 


def download_all(): #@save 
"""Download all files in the DATA_HUB.””” 
for name in DATA_HUB: 
download(name) 
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4.10.2 Kaggle 


Kaggle” is a popular platform that hosts machine learning competitions. Each competition cen- 
ters on a dataset and many are sponsored by stakeholders who offer prizes to the winning solu- 
tions. The platform helps users to interact via forums and shared code, fostering both collabo- 
ration and competition. While leaderboard chasing often spirals out of control, with researchers 
focusing myopically on preprocessing steps rather than asking fundamental questions, there is 
also tremendous value in the objectivity of a platform that facilitates direct quantitative compar- 
isons among competing approaches as well as code sharing so that everyone can learn what did 
and did not work. If you want to participate in a Kaggle competition, you will first need to register 
for an account (see Fig. 4.10.1). 


Search kaggle Q Competitions Datasets Kernels Discussion Learn +++ BEEE 





Kaggle is the place to do data 
science projects Sign up with just one click: 


We won't share anything without your permission 
See how it works ®© 





| Google Facebook 








| Yahoo | 





Manually create an account: 





























Fig. 4.10.1: The Kaggle website. 


On the house price prediction competition page, as illustrated in Fig. 4.10.2, you can find the 
dataset (under the “Data” tab), submit predictions, and see your ranking, The URL is right here: 


https://www.kaggle.com/c/house-prices-advanced-regression-techniques 


ii House Prices: Advanced Regression Techniques 
Predict sales prices and practice feature engineering, RFs, and gradient boosting 


s - Ongoin 


Overview Data Kernels Discussion Leaderboard Rules Team My Submissions Submit Predictions 


Overview 





Description Start here if... 


Evaluation You have some experience with R or Python and machine learning basics. This is a perfect competition 


Frequently Asked for data science students who have completed an online course in machine learning and are looking to 


Questions expand their skill set before trying a featured competition. 


Tutorials Competition Description 


Fig. 4.10.2: The house price prediction competition page. 





75 https://www.kaggle.com 
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4.10.3 Accessing and Reading the Dataset 


Note that the competition data is separated into training and test sets. Each record includes the 
property value of the house and attributes such as street type, year of construction, roof type, 
basement condition, etc. The features consist of various data types. For example, the year of 
construction is represented by an integer, the roof type by discrete categorical assignments, and 
other features by floating point numbers. And here is where reality complicates things: for some 
examples, some data are altogether missing with the missing value marked simply as “na”. The 
price of each house is included for the training set only (it is a competition after all). We will want 
to partition the training set to create a validation set, but we only get to evaluate our models on 
the official test set after uploading predictions to Kaggle. The “Data” tab on the competition tab in 
Fig. 4.10.2 has links to download the data. 


To get started, we will read in and process the data using pandas, which we have introduced in 
Section 2.2. So, you will want to make sure that you have pandas installed before proceeding fur- 
ther. Fortunately, if you are reading in Jupyter, we can install pandas without even leaving the 
notebook. 


# If pandas is not installed, please uncomment the following line: 
# !pip install pandas 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import gluon, autograd, init, np, npx 
from mxnet.gluon import nn 

import pandas as pd 

npx.set_np() 


For convenience, we can download and cache the Kaggle housing dataset using the script we de- 
fined above. 


DATA_HUB[ 'kaggle_house_train'] = ( #@save 
DATA_URL + 'kaggle_house_pred_train.csv', 
"585e9cc93e70b39160e7921475f 9bcd7d31219ce') 


DATA_HUB[ 'kaggle_house_test'] = ( #@save 
DATA_URL + 'kaggle_house_pred_test.csv', 
'fal978027b011d9b009e8bff8e99922a8ee2eb90”)>) 


We use pandas to load the two csv files containing training and test data respectively. 


train_data = pd.read_csv(download('kaggle_house_train’)) 
test_data = pd.read_csv(download('kaggle_house_test')) 


Downloading ../data/kaggle_house_pred_train.csv from http://d21-data.s3-accelerate.amazonaws. 
>com/kaggle_house_pred_train.csv... 
Downloading ../data/kaggle_house_pred_test.csv from http://d21-data.s3-accelerate.amazonaws. 
>com/kaggle_house_pred_test.csv... 


The training dataset includes 1460 examples, 80 features, and 1 label, while the test data contains 
1459 examples and 80 features. 
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print(train_data.shape) 
print(test_data.shape) 


(1460, 81) 
(1459, 80) 


Let us take a look at the first four and last two features as well as the label (SalePrice) from the first 
four examples. 


print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]) 


Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice 


o. 1 60 RL 65.0 WD Normal 208500 
i 2 20 RL 80.0 WD Normal 181500 
2 38 60 RL 68.0 WD Normal 223500 
3. 4 70 RL 60.0 WD Abnorml 140000 


We can see that in each example, the first feature is the ID. This helps the model identify each 
training example. While this is convenient, it does not carry any information for prediction pur- 
poses. Hence, we remove it from the dataset before feeding the data into the model. 


all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) 


4.10.4 Data Preprocessing 


As stated above, we have a wide variety of data types. We will need to preprocess the data before we 
can start modeling. Let us start with the numerical features. First, we apply a heuristic, replacing 
all missing values by the corresponding feature’s mean. Then, to put all features on a common 
scale, we standardize the data by rescaling features to zero mean and unit variance: 


pe TOA (4.10.1) 
a 
To verify that this indeed transforms our feature (variable) such that it has zero mean and unit 
variance, note that E[—#] = & = 0 and that E[(x— p)?] = (0? + p’) — 2u? + yu? = 07. Intuitively, 
we standardize the data for two reasons. First, it proves convenient for optimization. Second, 
because we do not know a priori which features will be relevant, we do not want to penalize coef- 
ficients assigned to one feature more than on any other. 


numeric_features = all_features.dtypes[all_features.dtypes != ‘'object'].index 
all_features[numeric_features] = all_features[numeric_features].apply( 
lambda x: (x - x.mean(Q)) / (x.stdQ)) 
# After standardizing the data all means vanish, hence we can set missing 
# values to 0 
all_features[numeric_features] = all_features[numeric_features].fillna(0) 


Next we deal with discrete values. This includes features such as “MSZoning”. We replace them by 
a one-hot encoding in the same way that we previously transformed multiclass labels into vectors 
(see Section 3.4.1). For instance, “MSZoning” assumes the values “RL” and “RM”. Dropping the 
“MSZoning” feature, two new indicator features “MSZoning_RL” and “MSZoning_RM” are created 
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with values being either 0 or 1. According to one-hot encoding, if the original value of “MSZon- 
ing” is “RL”, then “MSZoning_RL” is 1 and “MSZoning_RM” is 0. The pandas package does this 
automatically for us. 


n 


# ‘Dummy_na=True* considers "na” (missing value) as a valid feature value, and 
# creates an indicator feature for it 

all_features = pd.get_dummies(all_features, dummy_na=True) 

all_features.shape 


(2919, 331) 


You can see that this conversion increases the number of features from 79 to 331. Finally, via the 
values attribute, we can extract the NumPy format from the pandas format and convert it into the 
tensor representation for training. 


n_train = train_data.shapel0] 
train_features = np.array(all_features[:n_train].values, dtype=np.float32) 
test_features = np.array(all_features[n_train: ].values, dtype=np.float32) 
train_labels = np.array( 

train_data.SalePrice.values.reshape(-1, 1), dtype=np.float32) 


4.10.5 Training 


To get started we train a linear model with squared loss. Not surprisingly, our linear model will 
not lead to a competition-winning submission but it provides a sanity check to see whether there 
is meaningful information in the data. If we cannot do better than random guessing here, then 
there might be a good chance that we have a data processing bug. And if things work, the linear 
model will serve as a baseline giving us some intuition about how close the simple model gets 
to the best reported models, giving us a sense of how much gain we should expect from fancier 
models. 


loss = gluon.loss.L2Loss() 


def get_net(): 
net = nn.Sequential() 
net .add(nn.Dense(1)) 
net.initialize() 
return net 


With house prices, as with stock prices, we care about relative quantities more than absolute quan- 
tities. Thus we tend to care more about the relative error “ than about the absolute error y— y. 
For instance, if our prediction is off by USD 100,000 when estimating the price of a house in Rural 
Ohio, where the value of a typical house is 125,000 USD, then we are probably doing a horrible job. 
On the other hand, if we err by this amount in Los Altos Hills, California, this might represent a 
stunningly accurate prediction (there, the median house price exceeds 4 million USD). 


One way to address this problem is to measure the discrepancy in the logarithm of the price esti- 
mates. In fact, this is also the official error measure used by the competition to evaluate the quality 
of submissions. After all, a small value 6 for | log y — log ĝ| < ô translates into e? < 4 < eð. This 
leads to the following root-mean-squared-error between the logarithm of the predicted price and 
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the logarithm of the label price: 





n 


1 : 
= > > (log y; — log gi)”. (4.10.2) 


i=1 


def log_rmse(net, features, labels): 
# To further stabilize the value when the logarithm is taken, set the 
# value less than 1 as 1 
clipped_preds = np.clip(net(features), 1, float('inf')) 
return np.sqrt(2 * loss(np.log(clipped_preds), np.log(labels)).mean()) 


Unlike in previous sections, our training functions will rely on the Adam optimizer (we will de- 
scribe it in greater detail later). The main appeal of this optimizer is that, despite doing no better 
(and sometimes worse) given unlimited resources for hyperparameter optimization, people tend 
to find that it is significantly less sensitive to the initial learning rate. 


def train(net, train_features, train_labels, test_features, test_labels, 
num_epochs, learning_rate, weight_decay, batch_size): 
train_ls, test_ls = [], [] 
train_iter = d21.load_array((train_features, train_labels), batch_size) 
# The Adam optimization algorithm is used here 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, { 
'learning_rate': learning_rate, 'wd': weight_decay}) 
for epoch in range(num_epochs): 
for X, y in train_iter: 
with autograd.record(): 
1 = loss(net(X), y) 
1. backward() 
trainer.step(batch_size) 
train_ls.append(log_rmse(net, train_features, train_labels)) 
if test_labels is not None: 
test_ls.append(log_rmse(net, test_features, test_labels)) 
return train_ls, test_ls 


4.10.6 K-Fold Cross-Validation 


You might recall that we introduced K-fold cross-validation in the section where we discussed how 
to deal with model selection (Section 4.4). We will put this to good use to select the model design 
and to adjust the hyperparameters. We first need a function that returns the it fold of the data in 
a K-fold cross-validation procedure. It proceeds by slicing out the i segment as validation data 
and returning the rest as training data. Note that this is not the most efficient way of handling data 
and we would definitely do something much smarter if our dataset was considerably larger. But 
this added complexity might obfuscate our code unnecessarily so we can safely omit it here owing 
to the simplicity of our problem. 


def get_k_fold_data(k, i, X, y): 
assert k > 1 
fold_size = X.shape[@] // k 
X_train, y_train = None, None 
for j in range(k): 
idx = slice(j * fold_size, (j + 1) * fold_size) 


(continues on next page) 
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(continued from previous page) 


X_part, y_part = Xlidx, :], yLidx] 
if j = i: 
X_valid, y_valid = X_part, y_part 
elif X_train is None: 
X_train, y_train = X_part, y_part 
else: 
X_train = np.concatenate([X_train, X_part], 0) 
y_train = np.concatenate([y_train, y_part], 0) 
return X_train, y_train, X_valid, y_valid 


The training and verification error averages are returned when we train K times in the K-fold 
cross-validation. 


def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, 
batch_size): 
train_1_sum, valid_l_sum = 0, 0 
for i in range(k): 
data = get_k_fold_data(k, i, X_train, y_train) 
net = get_net() 
train_1s, valid_1s = train(net, *data, num_epochs, learning_rate, 
weight_decay, batch_size) 
train_1_sum += train_ls[-1] 
valid_1_sum += valid_ls[-1] 
if i == 
d21.plot(list(range(1, num_epochs + 1)), [train_ls, valid_1s], 
xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs], 
legend=['train’, 'valid'], yscale='log’) 
print(f’fold {i + 1}, train log rmse {float(train_ls[-1]):f}, ' 
f'valid log rmse {float(valid_ls[-1]):f}') 
return train_l_sum / k, valid_l_sum / k 


4.10.7 Model Selection 


In this example, we pick an untuned set of hyperparameters and leave it up to the reader to im- 
prove the model. Finding a good choice can take time, depending on how many variables one 
optimizes over. With a large enough dataset, and the normal sorts of hyperparameters, K-fold 
cross-validation tends to be reasonably resilient against multiple testing. However, if we try an 
unreasonably large number of options we might just get lucky and find that our validation perfor- 
mance is no longer representative of the true error. 


k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64 
train_1, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, 
weight_decay, batch_size) 
print(f’{k}-fold validation: avg train log rmse: (float(train_1):f), ’ 
f'avg valid log rmse: (float(valid_1):f)') 


fold 1, train log rmse 0.169755, valid log rmse 0.157162 
fold 2, train log rmse 0.162392, valid log rmse 0.188604 
fold 3, train log rmse 0.163703, valid log rmse 0.167751 
fold 4, train log rmse 0.167760, valid log rmse 0.154765 


(continues on next page) 
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(continued from previous page) 


fold 5, train log rmse 0.162481, valid log rmse 0.182729 
5-fold validation: avg train log rmse: 0.165218, avg valid log rmse: 0.170202 


—— train 
=== valid 





20 40 60 80 100 
epoch 


Notice that sometimes the number of training errors for a set of hyperparameters can be very 
low, even as the number of errors on K-fold cross-validation is considerably higher. This indi- 
cates that we are overfitting. Throughout training you will want to monitor both numbers. Less 
overfitting might indicate that our data can support a more powerful model. Massive overfitting 
might suggest that we can gain by incorporating regularization techniques. 


4.10.8 Submitting Predictions on Kaggle 


Now that we know what a good choice of hyperparameters should be, we might as well use all the 
data to train on it (rather than just 1 — 1/K of the data that are used in the cross-validation slices). 
The model that we obtain in this way can then be applied to the test set. Saving the predictions in 
a csv file will simplify uploading the results to Kaggle. 


def train_and_pred(train_features, test_feature, train_labels, test_data, 
num_epochs, lr, weight_decay, batch_size): 
net = get_net() 
train_ls, _ = train(net, train_features, train_labels, None, None, 
num_epochs, lr, weight_decay, batch_size) 
d21.plot(np.arange(1, num_epochs + 1), [train_ls], 
xlabel='epoch’, 
ylabel='log rmse’, 
xlim=[1, num_epochs], 
yscale='log’) 
print(f'train log rmse {float(train_ls[-1]):f}’) 
# Apply the network to the test set 
preds = net(test_features) .asnumpy() 
# Reformat it to export to Kaggle 
test_data['SalePrice’] = pd.Series(preds.reshape(1, -1)L0]) 
submission = pd.concat([test_data['Id'], test_data[’SalePrice']], axis=1) 
submission. to_csv(’submission.csv’, index=False) 


One nice sanity check is to see whether the predictions on the test set resemble those of the K-fold 
cross-validation process. If they do, it is time to upload them to Kaggle. The following code will 
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generate a file called submission.csv. 


train_and_pred(train_features, test_features, train_labels, test_data, 
num_epochs, lr, weight_decay, batch_size) 


train log rmse 0.162379 


10° 


log rmse 


20 40 60 80 100 
epoch 


Next, as demonstrated in Fig. 4.10.3, we can submit our predictions on Kaggle and see how they 
compare with the actual house prices (labels) on the test set. The steps are quite simple: 


* Log in to the Kaggle website and visit the house price prediction competition page. 


* Click the “Submit Predictions” or “Late Submission” button (as of this writing, the button is 
located on the right). 


e Click the “Upload Submission File” button in the dashed box at the bottom of the page and 
select the prediction file you wish to upload. 


* Click the “Make Submission” button at the bottom of the page to view your results. 


Step1 


4 


Upload Submission File 


Your submission should be in CSV format. We expect the solution file to have 1459 prediction rows. This file 
You can upload this in a zip/gz/rar/7z should have a header row. Please see sample submission file on 
archive, if you prefer. the data page. 


Step 2 B I Seo | FEHH ~ [m3] 


Make Submission 


Fig. 4.10.3: Submitting data to Kaggle 
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Summary 


Real data often contain a mix of different data types and need to be preprocessed. 


Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing 
missing values with their mean. 


Transforming categorical features into indicator features allows us to treat them like one-hot 
vectors. 


We can use K-fold cross-validation to select the model and adjust the hyperparameters. 


Logarithms are useful for relative errors. 


Exercises 
1. Submit your predictions for this section to Kaggle. How good are your predictions? 
2. Can you improve your model by minimizing the logarithm of prices directly? What happens 
if you try to predict the logarithm of the price rather than the price? 
3. Is it always a good idea to replace missing values by their mean? Hint: can you construct a 
situation where the values are not missing at random? 
4. Improve the score on Kaggle by tuning the hyperparameters through K-fold cross- 
validation. 
5. Improve the score by improving the model (e.g., layers, weight decay, and dropout). 
6. What happensif we do not standardize the continuous numerical features like what we have 
done in this section? 
Discussions”* 





76 https://discuss.d21.ai/t/106 
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5 Deep Learning Computation 


Alongside giant datasets and powerful hardware, great software tools have played an indispens- 
able role in the rapid progress of deep learning. Starting with the pathbreaking Theano library 
released in 2007, flexible open-source tools have enabled researchers to rapidly prototype models, 
avoiding repetitive work when recycling standard components while still maintaining the ability 
to make low-level modifications. Over time, deep learning's libraries have evolved to offer in- 
creasingly coarse abstractions. Just as semiconductor designers went from specifying transistors 
to logical circuits to writing code, neural networks researchers have moved from thinking about 
the behavior of individual artificial neurons to conceiving of networks in terms of whole layers, 
and now often design architectures with far coarser blocks in mind. 


So far, we have introduced some basic machine learning concepts, ramping up to fully-functional 
deep learning models. In the last chapter, we implemented each component of an MLP from 
scratch and even showed how to leverage high-level APIs to roll out the same models effortlessly. 
To get you that far that fast, we called upon the libraries, but skipped over more advanced details 
about how they work. In this chapter, we will peel back the curtain, digging deeper into the key 
components of deep learning computation, namely model construction, parameter access and 
initialization, designing custom layers and blocks, reading and writing models to disk, and lever- 
aging GPUs to achieve dramatic speedups. These insights will move you from end user to power 
user, giving you the tools needed to reap the benefits of a mature deep learning library while re- 
taining the flexibility to implement more complex models, including those you invent yourself! 
While this chapter does not introduce any new models or datasets, the advanced modeling chap- 
ters that follow rely heavily on these techniques. 


5.1 Layers and Blocks 


When we first introduced neural networks, we focused on linear models with a single output. 
Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some 
set of inputs; (ii) generates a corresponding scalar output; and (iii) has a set of associated param- 
eters that can be updated to optimize some objective function of interest. Then, once we started 
thinking about networks with multiple outputs, we leveraged vectorized arithmetic to characterize 
an entire layer of neurons. Just like individual neurons, layers (i) take a set of inputs, (ii) generate 
corresponding outputs, and (iii) are described by a set of tunable parameters. When we worked 
through softmax regression, a single layer was itself the model. However, even when we subse- 
quently introduced MLPs, we could still think of the model as retaining this same basic structure. 


Interestingly, for MLPs, both the entire model and its constituent layers share this structure. The 
entire model takes in raw inputs (the features), generates outputs (the predictions), and possesses 
parameters (the combined parameters from all constituent layers). Likewise, each individual layer 
ingests inputs (supplied by the previous layer) generates outputs (the inputs to the subsequent 
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layer), and possesses a set of tunable parameters that are updated according to the signal that 
flows backwards from the subsequent layer. 


While you might think that neurons, layers, and models give us enough abstractions to go about 
our business, it turns out that we often find it convenient to speak about components that are 
larger than an individual layer but smaller than the entire model. For example, the ResNet-152 
architecture, which is wildly popular in computer vision, possesses hundreds of layers. These 
layers consist of repeating patterns of groups of layers. Implementing such a network one layer at 
a time can grow tedious. This concern is not just hypothetical—such design patterns are common 
in practice. The ResNet architecture mentioned above won the 2015 ImageNet and COCO com- 
puter vision competitions for both recognition and detection (He et al., 2016a) and remains a go-to 
architecture for many vision tasks. Similar architectures in which layers are arranged in various 
repeating patterns are now ubiquitous in other domains, including natural language processing 
and speech. 


To implement these complex networks, we introduce the concept of a neural network block. A 
block could describe a single layer, a component consisting of multiple layers, or the entire model 
itself! One benefit of working with the block abstraction is that they can be combined into larger 
artifacts, often recursively. This is illustrated in Fig. 5.1.1. By defining code to generate blocks 
of arbitrary complexity on demand, we can write surprisingly compact code and still implement 
complex neural networks. 





Fig. 5.1.1: Multiple layers are combined into blocks, forming repeating patterns of larger models. 


From a programing standpoint, a block is represented by a class. Any subclass of it must define a 
forward propagation function that transforms its input into output and must store any necessary 
parameters. Note that some blocks do not require any parameters at all. Finally a block must pos- 
sess a backpropagation function, for purposes of calculating gradients. Fortunately, due to some 
behind-the-scenes magic supplied by the auto differentiation (introduced in Section 2.5) when 
defining our own block, we only need to worry about parameters and the forward propagation 
function. 


To begin, we revisit the code that we used to implement MLPs (Section 4.3). The following code 
generates a network with one fully-connected hidden layer with 256 units and ReLU activation, 
followed by a fully-connected output layer with 10 units (no activation function). 
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from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


net = nn.Sequential() 
net.add(nn.Dense(256, activation='relu’)) 
net.add(nn.Dense(10)) 

net.initialize() 


X = np.random.uniform(size=(2, 20)) 
net(X) 


array([[ 0.06240274, -0.03268593, 0.02582653, 0.02254181, -0.03728798, 
-0.04253785, 0.00540612, -0.01364185, -0.09915454, -0.02272737], 
[ 0.02816679, -0.03341204, 0.03565665, 0.02506384, -0.04136416, 
-0.04941844, 0.01738529, 0.01081963, -0.09932579, -0.01176296]]) 


In this example, we constructed our model by instantiating an nn.Sequential, assigning the re- 
turned object to the net variable. Next, we repeatedly call its add function, appending layers in 
the order that they should be executed. In short, nn. Sequential defines a special kind of Block, 
the class that presents a block in Gluon. It maintains an ordered list of constituent Blocks. The 
add function simply facilitates the addition of each successive Block to the list. Note that each 
layer is an instance of the Dense class which is itself a subclass of Block. The forward propagation 
(forward) function is also remarkably simple: it chains each Block in the list together, passing the 
output of each as the input to the next. Note that until now, we have been invoking our models via 
the construction net (X) to obtain their outputs. This is actually just shorthand for net . forward(X), 
a slick Python trick achieved via the Block class’s __call__ function. 


5.1.1 A Custom Block 


Perhaps the easiest way to develop intuition about how a block works is to implement one our- 
selves. Before we implement our own custom block, we briefly summarize the basic functionality 
that each block must provide: 


1. Ingest input data as arguments to its forward propagation function. 


2. Generate an output by having the forward propagation function return a value. Note that 
the output may have a different shape from the input. For example, the first fully-connected 
layer in our model above ingests an input of arbitrary dimension but returns an output of 
dimension 256. 


3. Calculate the gradient of its output with respect to its input, which can be accessed via its 
backpropagation function. Typically this happens automatically. 


4. Store and provide access to those parameters necessary to execute the forward propagation 
computation. 


5. Initialize model parameters as needed. 


In the following snippet, we code up a block from scratch corresponding to an MLP with one hid- 
den layer with 256 hidden units, and a 10-dimensional output layer. Note that the MLP class below 
inherits the class that represents a block. We will heavily rely on the parent class's functions, sup- 
plying only our own constructor (the __init__ function in Python) and the forward propagation 
function. 
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class MLP(nn.Block): 

# Declare a layer with model parameters. Here, we declare two 

# fully-connected layers 

def __init__(self, **xkwargs): 
# Call the constructor of the ‘MLP* parent class ‘Block* to perform 
# the necessary initialization. In this way, other function arguments 
# can also be specified during class instantiation, such as the model 
$ parameters, ‘params‘ (to be described later) 
super().__init__(**kwargs) 
self.hidden = nn.Dense(256, activation='relu') + Hidden layer 
self.out = nn.Dense(10) # Output layer 


# Define the forward propagation of the model, that is, how to return the 
# required model output based on the input *X' 
def forward(self, X): 

return self.out(self.hidden(Xx)) 


Let us first focus on the forward propagation function. Note that it takes X as the input, calculates 
the hidden representation with the activation function applied, and outputs its logits. In this MLP 
implementation, both layers are instance variables. To see why this is reasonable, imagine instan- 
tiating two MLPs, net1 and net2, and training them on different data. Naturally, we would expect 
them to represent two different learned models. 


We instantiate the MLP’s layers in the constructor and subsequently invoke these layers on each 
call to the forward propagation function. Note a few key details. First, our customized __init__ 
function invokes the parent class’s __init__ function via super().__init__() sparing us the 
pain of restating boilerplate code applicable to most blocks. We then instantiate our two fully- 
connected layers, assigning them to self .hidden and self.out. Note that unless we implement a 
new operator, we need not worry about the backpropagation function or parameter initialization. 
The system will generate these functions automatically. Let us try this out. 


net = MLP() 
net.initialize() 
net(X) 


array([[-0.03989595, -0.10414709, 0.06799038, 0.05245074, 0.0252606 , 
-0.00640342, 0.04182098, -0.01665318, -0.02067345, -0.07863816], 
[-0.03612847, -0.07210435, 0.09159479, 0.07890773, 0.02494171, 
-0.01028665, 0.01732427, -0.02843244, 0.03772651, -0.06671703]]) 


A key virtue ofthe block abstraction is its versatility. We can subclass a block to create layers (such 
as the fully-connected layer class), entire models (such as the MLP class above), or various compo- 
nents of intermediate complexity. We exploit this versatility throughout the following chapters, 
such as when addressing convolutional neural networks. 
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5.1.2 The Sequential Block 


We can now take a closer look at how the Sequential class works. Recall that Sequential was 
designed to daisy-chain other blocks together. To build our own simplified MySequential, we just 
need to define two key function: 1. A function to append blocks one by one to a list. 2. A forward 
propagation function to pass an input through the chain of blocks, in the same order as they were 
appended. 


The following MySequential class delivers the same functionality of the default Sequential class. 


class MySequential(nn.Block): 
def add(self, block): 

# Here, ‘block* is an instance of a 'Block' subclass, and we assume 
# that it has a unique name. We save it in the member variable 
# ‘_children* of the ‘Block* class, and its type is OrderedDict. When 
# the ‘MySequential* instance calls the ‘initialize* function, the 
# system automatically initializes all members of *_children' 
self._children[block.name] = block 


def forward(self, X): 
# OrderedDict guarantees that members will be traversed in the order 
# they were added 
for block in self._children.values(): 
X = block(X) 
return X 


The add function adds a single block to the ordered dictionary _children. You might wonder why 
every Gluon Block possesses a _children attribute and why we used it rather than just define a 
Python list ourselves. In short the chief advantage of _children is that during our block’s param- 
eter initialization, Gluon knows to look inside the _children dictionary to find sub-blocks whose 
parameters also need to be initialized. 


When our MySequential’s forward propagation function is invoked, each added block is executed 
in the order in which they were added. We can now reimplement an MLP using our MySequential 
class. 


net = MySequential() 
net.add(nn.Dense(256, activation='relu’)) 
net.add(nn.Dense(10)) 

net.initialize() 

net(X) 


array([[-0.0764568 , -0.01130233, 0.04952145, -0.04651389, -0.04131571, 
-0.05884131, -0.06213811, 0.01311471, -0.01379425, -0.02514282], 
[-0.05124623, @.00711232, -0.00155933, -0.07555379, -0.06675334, 
-0.01762914, 0.00589085, 0.0144719 , -0.04330775, 0.03317727]11) 


Note that this use of MySequential is identical to the code we previously wrote for the Sequential 
class (as described in Section 4.3). 
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5.1.3 Executing Code in the Forward Propagation Function 


The Sequential class makes model construction easy, allowing us to assemble new architectures 
without having to define our own class. However, not all architectures are simple daisy chains. 
When greater flexibility is required, we will want to define our own blocks. For example, we 
might want to execute Python's control flow within the forward propagation function. Moreover, 
we might want to perform arbitrary mathematical operations, not simply relying on predefined 
neural network layers. 


You might have noticed that until now, all of the operations in our networks have acted upon our 
network's activations and its parameters. Sometimes, however, we might want to incorporate 
terms that are neither the result of previous layers nor updatable parameters. We call these con- 
stant parameters. Say for example that we want a layer that calculates the function f(x, w) = c-w'x, 
where x is the input, w is our parameter, and cis some specified constant that is not updated dur- 


ing optimization. So we implement a FixedHiddenMLP class as follows. 


class FixedHiddenMLP(nn.Block): 
def __init__(self, **xkwargs): 
super().__init__(**kwargs) 
# Random weight parameters created with the ‘get_constant* function 
# are not updated during training (i.e., constant parameters) 
self.rand_weight = self.params.get_constant( 
‘rand_weight’, np.random.uniform(size=(20, 20))) 
self.dense = nn.Dense(20, activation='relu’) 


def forward(self, X): 
X = self.dense(X) 
# Use the created constant parameters, as well as the 'relu' and ‘dot 
# functions 
X = npx.relu(np.dot(X, self.rand_weight.data()) + 1) 
# Reuse the fully-connected layer. This is equivalent to sharing 
# 
X 
# 


A 


parameters with two fully-connected layers 
= self.dense(X) 
Control flow 
while np.abs(X).sum() > 1: 
X /= 2 
return X.sum() 


In this FixedHiddenMLP model, we implement a hidden layer whose weights (self. rand_weight) 
are initialized randomly at instantiation and are thereafter constant. This weight is not a model 
parameter and thus it is never updated by backpropagation. The network then passes the output 
of this “fixed” layer through a fully-connected layer. 


Note that before returning the output, our model did something unusual. We ran a while-loop, 
testing on the condition its Lı norm is larger than 1, and dividing our output vector by 2 until it 
satisfied the condition. Finally, we returned the sum of the entries in X. To our knowledge, no 
standard neural network performs this operation. Note that this particular operation may not be 
useful in any real-world task. Our point is only to show you how to integrate arbitrary code into 
the flow of your neural network computations. 


net = FixedHiddenMLP() 
net. initialize() 
net (X) 
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array(0.52637565) 


We can mix and match various ways of assembling blocks together. In the following example, we 
nest blocks in some creative ways. 


class NestMLP(nn.Block): 
def __init__(self, **kwargs): 
super().__init__(**kwargs) 
self.net = nn.Sequential() 
self .net.add(nn.Dense(64, activation='relu’), 
nn.Dense(32, activation='relu')) 
self.dense = nn.Dense(16, activation='relu') 


def forward(self, X): 
return self.dense(self.net(X)) 


chimera = nn.Sequential() 

chimera.add(NestMLP(), nn.Dense(20), FixedHiddenMLP()) 
chimera. initialize() 

chimera(X) 


array(0.9772054) 


5.1.4 Efficiency 


The avid reader might start to worry about the efficiency of some of these operations. After all, 
we have lots of dictionary lookups, code execution, and lots of other Pythonic things taking place 
in what is supposed to be a high-performance deep learning library. The problems of Python's 
global interpreter lock’’ are well known. In the context of deep learning, we may worry that our 
extremely fast GPU(s) might have to wait until a puny CPU runs Python code before it gets another 
job to run. The best way to speed up Python is by avoiding it altogether. 


One way that Gluon does this is by allowing for hybridization, which will be described later. Here, 
the Python interpreter executes a block the first time it is invoked. The Gluon runtime records 
what is happening and the next time around it short-circuits calls to Python. This can accelerate 
things considerably in some cases but care needs to be taken when control flow (as above) leads 
down different branches on different passes through the net. We recommend that the interested 
reader checks out the hybridization section (Section 12.1) to learn about compilation after finish- 
ing the current chapter. 





77 https://wiki. python.org/moin/GlobalInterpreterLock 
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Summary 


+ Layers are blocks. 

+ Many layers can comprise a block. 
e Many blocks can comprise a block. 
+ A block can contain code. 


e Blocks take care of lots of housekeeping, including parameter initialization and backpropa- 
gation. 


e Sequential concatenations of layers and blocks are handled by the Sequential block. 


Exercises 


1. What kinds of problems will occur if you change MySequential to store blocks in a Python 
list? 


2. Implement a block that takes two blocks as an argument, say net1 and net2 and returns 
the concatenated output of both networks in the forward propagation. This is also called a 
parallel block. 


3. Assume that you want to concatenate multiple instances of the same network. Implement 
a factory function that generates multiple instances of the same block and build a larger 
network from it. 


Discussions”? 


5.2 Parameter Management 


Once we have chosen an architecture and set our hyperparameters, we proceed to the training 
loop, where our goal is to find parameter values that minimize our loss function. After training, we 
will need these parameters in order to make future predictions. Additionally, we will sometimes 
wish to extract the parameters either to reuse them in some other context, to save our model 
to disk so that it may be executed in other software, or for examination in the hope of gaining 
scientific understanding. 


Most of the time, we will be able to ignore the nitty-gritty details of how parameters are declared 
and manipulated, relying on deep learning frameworks to do the heavy lifting. However, when we 
move away from stacked architectures with standard layers, we will sometimes need to get into 
the weeds of declaring and manipulating parameters. In this section, we cover the following: 


e Accessing parameters for debugging, diagnostics, and visualizations. 
e Parameter initialization. 
° Sharing parameters across different model components. 


We start by focusing on an MLP with one hidden layer. 





7 https://discuss.d21.ai/t/54 
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from mxnet import init, np, npx 
from mxnet.gluon import nn 
npx.set_np() 


net = nn.Sequential() 

net.add(nn.Dense(8, activation='relu’)) 

net .add(nn.Dense(1)) 

net.initialize() + Use the default initialization method 


X = np.random.uniform(size=(2, 4)) 
net(X) # Forward computation 


array([[0.0054572 ], 
[0.00488594]]) 


5.2.1 Parameter Access 


Let us start with how to access parameters from the models that you already know. When a model 
is defined via the Sequential class, we can first access any layer by indexing into the model as 
though it were a list. Each layer's parameters are conveniently located in its attribute. We can 
inspect the parameters of the second fully-connected layer as follows. 


print(net[1].params) 


densel_ ( 
Parameter densel_weight (shape=(1, 8), dtype=float32) 
Parameter densel_bias (shape=(1,), dtype=float32) 

) 


The output tells us a few important things. First, this fully-connected layer contains two parame- 
ters, corresponding to that layer's weights and biases, respectively. Both are stored as single pre- 
cision floats (float32). Note that the names of the parameters allow us to uniquely identify each 
layer's parameters, even in a network containing hundreds of layers. 


Targeted Parameters 


Note that each parameter is represented as an instance of the parameter class. To do anything 
useful with the parameters, we first need to access the underlying numerical values. There are 
several ways to do this. Some are simpler while others are more general. The following code 
extracts the bias from the second neural network layer, which returns a parameter class instance, 
and further accesses that parameter's value. 


print(type(net[1].bias)) 
print(net[1].bias) 
print(net[1].bias.data()) 


<class 'mxnet.gluon.parameter.Parameter'> 
Parameter densel_bias (shape=(1,), dtype=float32) 
[o.] 
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Parameters are complex objects, containing values, gradients, and additional information. That's 
why we need to request the value explicitly. 


In addition to the value, each parameter also allows us to access the gradient. Because we have 
not invoked backpropagation for this network yet, it is in its initial state. 


net[1].weight.grad() 


tad. Was Wag Gan Des Des Vis oT 


All Parameters at Once 


When we need to perform operations on all parameters, accessing them one-by-one can grow 
tedious. The situation can grow especially unwieldy when we work with more complex blocks 
(e.g., nested blocks), since we would need to recurse through the entire tree to extract each sub- 
block's parameters. Below we demonstrate accessing the parameters of the first fully-connected 
layer vs. accessing all layers. 


print(net[0].collect_params()) 
print(net.collect_params()) 


denseQ_ ( 
Parameter dense0_weight (shape=(8, 4), dtype=float32) 
Parameter dense0_bias (shape=(8,), dtype=float32) 

) 

sequential0_ ( 
Parameter dense0_weight (shape=(8, 4), dtype=float32) 
Parameter dense0_bias (shape=(8,), dtype=float32) 
Parameter densel_weight (shape=(1, 8), dtype=float32) 
Parameter densel_bias (shape=(1,), dtype=float32) 


This provides us with another way of accessing the parameters of the network as follows. 


net.collect_params()L 'densel_bias'].data() 


array(L0.]) 


Collecting Parameters from Nested Blocks 


Let us see how the parameter naming conventions work if we nest multiple blocks inside each 
other. For that we first define a function that produces blocks (a block factory, so to speak) and 
then combine these inside yet larger blocks. 


def block1(): 
net = nn.Sequential() 
net.add(nn.Dense(32, activation='relu')) 
net.add(nn.Dense(16, activation='relu')) 


(continues on next page) 
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(continued from previous page) 


return net 


def block2(): 
net = nn.Sequential() 
for _ in range(4): 
# Nested here 
net .add(block1()) 


return net 


rgnet = nn.Sequential() 
rgnet.add(block2()) 
rgnet.add(nn.Dense(10)) 
rgnet.initialize() 
rgnet(X) 


array(L[-6.3465846e-09, -1.1096752e-09, 6.4161787e-09, 6.6354140e-09, 
-1.1265507e-09, 1.3284951e-10, 9.3619388e-09, 3.2229084e-09, 
5.9429879e-09, 8.8181435e-09], 
[-8.6219423e-09, -7.5150686e-10, 8.3133251e-09, 8.9321128e-09, 
-1.6740003e-09, 3.2405989e-10, 1.2115976e-08, 4.4926449e-09, 
8.0741742e-09, 1.2075874e-08]]) 


Now that we have designed the network, let us see how it is organized. 


print(rgnet.collect_params) 
print(rgnet.collect_params()) 


<bound method Block.collect_params of Sequential( 
(0): Sequential ( 
(0): Sequential( 
(0): Dense(4 -> 32, Activation(relu)) 
(1): Dense(32 -> 16, Activation(relu)) 
) 
(1): Sequential( 
(0): Dense(16 -> 32, Activation(relu)) 
(1): Dense(32 -> 16, Activation(relu)) 
) 
(2): Sequential( 
(0): Dense(16 -> 32, Activation(relu)) 
(1): Dense(32 -> 16, Activation(relu)) 
) 
(3): Sequential( 
(0): Dense(16 -> 32, Activation(relu)) 
(1): Dense(32 -> 16, Activation(relu)) 
) 
) 
(1): Dense(16 -> 10, linear) 
)> 
sequentiall_ ( 
Parameter dense2_weight (shape=(32, 4), dtype=float32) 
Parameter dense2_bias (shape=(32,), dtype=float32) 
Parameter dense3_weight (shape=(16, 32), dtype=float32) 
Parameter dense3_bias (shape=(16,), dtype=float32) 


(continues on next page) 
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(continued from previous page) 


Parameter dense4_ weight (shape=(32, 16), dtype=float32) 
Parameter dense4_bias (shape=(32,), dtype=float32) 
Parameter dense5_weight (shape=(16, 32), dtype=float32) 
Parameter dense5_bias (shape=(16,), dtype=float32) 
Parameter dense6_weight (shape=(32, 16), dtype=float32) 
Parameter dense6_bias (shape=(32,), dtype=float32) 
Parameter dense7_weight (shape=(16, 32), dtype=float32) 
Parameter dense7_bias (shape=(16,), dtype=float32) 
Parameter dense8_weight (shape=(32, 16), dtype=float32) 
Parameter dense8_bias (shape=(32,), dtype=float32) 
Parameter dense9_weight (shape=(16, 32), dtype=float32) 
Parameter dense9_bias (shape=(16,), dtype=float32) 
Parameter densel0_weight (shape=(10, 16), dtype=float32) 
Parameter densel0_bias (shape=(10,), dtype=float32) 


Since the layers are hierarchically nested, we can also access them as though indexing through 
nested lists. For instance, we can access the first major block, within it the second sub-block, and 
within that the bias of the first layer, with as follows. 


rgnet[01[1][0].bias.data() 


5.2.2 Parameter Initialization 


Now that we know how to access the parameters, let us look at how to initialize them properly. 
We discussed the need for proper initialization in Section 4.8. The deep learning framework pro- 
vides default random initializations to its layers. However, we often want to initialize our weights 
according to various other protocols. The framework provides most commonly used protocols, 
and also allows to create a custom initializer. 


By default, MXNet initializes weight parameters by randomly drawing from a uniform distribution 
U(—0.07, 0.07), clearing bias parameters to zero. MXNet's init module provides a variety of preset 
initialization methods. 


Built-in Initialization 


Let us begin by calling on built-in initializers. The code below initializes all weight parameters as 
Gaussian random variables with standard deviation 0.01, while bias parameters cleared to zero. 


# Here 'force_reinit' ensures that parameters are freshly initialized even if 
# they were already initialized previously 
net.initialize(init=init.Normal(sigma=0.01), force_reinit=True) 
net[0].weight.data()[0] 


array([-0.00324057, -0.00895028, -0.00698632, 0.01030831]) 
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We can also initialize all the parameters to a given constant value (say, 1). 


net.initialize(init=init.Constant(1), force_reinit=True) 
net[0].weight.data()L01 


array bile, es Meg AID 


We can also apply different initializers for certain blocks. For example, below we initialize the 
first layer with the Xavier initializer and initialize the second layer to a constant value of 42. 


net[0].weight.initialize(init=init.Xavier(), force_reinit=True) 
net[11.initialize(init=init.Constant(42), force_reinit=True) 
print(net[0].weight.data()[0]) 

print(net[1].weight.data()) 


[-0.17594433 0.02314097 -0.1992535  0.09509248] 
[[42. 42. 42. 42. 42. 42. 42. 42.11 


Custom Initialization 


Sometimes, the initialization methods we need are not provided by the deep learning framework. 
In the example below, we define an initializer for any weight parameter w using the following 
strange distribution: 


U(5, 10) with probability 4 
w~ 0 with probability 5 (5.2.1) 
U(-10,—5) with probability + 


Here we define a subclass of the Initializer class. Usually, we only need to implement the 
_init_weight function which takes a tensor argument (data) and assigns to it the desired initial- 
ized values. 


class MyInit(init.Initializer): 
def _init_weight(self, name, data): 
print('Init', name, data.shape) 
data[:] = np.random.uniform(-10, 10, data.shape) 
data *= np.abs(data) >= 5 


net.initialize(MyInit(), force_reinit=True) 
net[0].weight.data()[:2] 


Init dense0_weight (8, 4) 
Init densel_weight (1, 8) 


array(L[ 0. O 5 e 5 Boa II, 
CO. , -8.828651 , -0. , ~5.6012006]]) 


Note that we always have the option of setting parameters directly. 
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net[0].weight.data()[:] += 1 
net[0].weight.data()[0, 0] = 42 
net[0].weight.data()[0] 


array([42. le repel D2 2 S ZA) 


A note for advanced users: if you want to adjust parameters within an autograd scope, you need 
to use set_data to avoid confusing the automatic differentiation mechanics. 


5.2.3 Tied Parameters 


Often, we want to share parameters across multiple layers. Let us see how to do this elegantly. 
In the following we allocate a dense layer and then use its parameters specifically to set those of 
another layer. 


net = nn.Sequential() 
# We need to give the shared layer a name so that we can refer to its 
# parameters 
shared = nn.Dense(8, activation='relu’) 
net.add(nn.Dense(8, activation='relu'), 
shared, 
nn.Dense(8, activation='relu', params=shared.params) , 
nn.Dense(10)) 
net.initialize() 


X = np.random.uniform(size=(2, 20)) 
net(X) 


# Check whether the parameters are the same 

print(net[1].weight.data()[0] == net[2].weight.data()L0]) 
net[1].weight.data()L0, 0] = 100 

# Make sure that they are actually the same object rather than just having the 
# same value 

print(net[1].weight.data()[0] == net[2].weight.data()L0]) 


[ True True True True True True True True] 
[ True True True True True True True True] 


This example shows that the parameters of the second and third layer are tied. They are not just 
equal, they are represented by the same exact tensor. Thus, if we change one of the parameters, 
the other one changes, too. You might wonder, when parameters are tied what happens to the 
gradients? Since the model parameters contain gradients, the gradients of the second hidden 
layer and the third hidden layer are added together during backpropagation. 
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Summary 


e We have several ways to access, initialize, and tie model parameters. 


e We can use custom initialization. 


Exercises 
1. Use the FancyMLP model defined in Section 5.1 and access the parameters of the various lay- 
ers. 
2. Look at the initialization module document to explore different initializers. 


3. Construct an MLP containing a shared parameter layer and train it. During the training 
process, observe the model parameters and gradients of each layer. 


4. Whyis sharing parameters a good idea? 


Discussions”? 


5.3 Deferred Initialization 


So far, it might seem that we got away with being sloppy in setting up our networks. Specifically, 
we did the following unintuitive things, which might not seem like they should work: 


+ We defined the network architectures without specifying the input dimensionality. 
* We added layers without specifying the output dimension of the previous layer. 


e We even “initialized” these parameters before providing enough information to determine 
how many parameters our models should contain. 


You might be surprised that our code runs at all. After all, there is no way the deep learning 
framework could tell what the input dimensionality of a network would be. The trick here is that 
the framework defers initialization, waiting until the first time we pass data through the model, to 
infer the sizes of each layer on the fly. 


Later on, when working with convolutional neural networks, this technique will become even 
more convenient since the input dimensionality (i.e., the resolution of an image) will affect the 
dimensionality of each subsequent layer. Hence, the ability to set parameters without the need 
to know, at the time of writing the code, what the dimensionality is can greatly simplify the task 
of specifying and subsequently modifying our models. Next, we go deeper into the mechanics of 
initialization. 





” https://discuss.d21.ai/t/56 
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5.3.1 Instantiating a Network 


To begin, let us instantiate an MLP. 


from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


def get_net(): 
net = nn.Sequential() 
net.add(nn.Dense(256, activation='relu')) 
net .add(nn.Dense(10)) 
return net 


net = get_net() 


At this point, the network cannot possibly know the dimensions of the input layer’s weights be- 
cause the input dimension remains unknown. Consequently the framework has not yet initialized 
any parameters. We confirm by attempting to access the parameters below. 


print(net.collect_params) 
print(net.collect_params()) 


<bound method Block.collect_params of Sequential( 
(0): Dense(-1 -> 256, Activation(relu)) 
(1): Dense(-1 -> 10, linear) 

)> 

sequential0_ ( 
Parameter dense0_weight (shape=(256, -1), dtype=float32) 
Parameter dense0_bias (shape=(256,), dtype=float32) 
Parameter densel_weight (shape=(10, -1), dtype=float32) 
Parameter densel_bias (shape=(10,), dtype=float32) 


Note that while the parameter objects exist, the input dimension to each layer is listed as -1. MXNet 
uses the special value -1 to indicate that the parameter dimension remains unknown. At this point, 
attempts to access net[0].weight.data() would trigger a runtime error stating that the network 
must be initialized before the parameters can be accessed. Now let us see what happens when we 
attempt to initialize parameters via the initialize function. 


net.initialize() 
net.collect_params() 


sequential0_ ( 
Parameter dense0_weight (shape=(256, -1), dtype=float32) 
Parameter dense0_bias (shape=(256,), dtype=float32) 
Parameter densel_weight (shape=(10, -1), dtype=float32) 
Parameter densel_bias (shape=(10,), dtype=float32) 


As we can see, nothing has changed. When input dimensions are unknown, calls to initialize do 
not truly initialize the parameters. Instead, this call registers to MXNet that we wish (and option- 
ally, according to which distribution) to initialize the parameters. 
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Next let us pass data through the network to make the framework finally initialize parameters. 


X = np.random.uniform(size=(2, 20)) 
net (X) 


net.collect_params() 


sequential0_ ( 
Parameter dense0_weight (shape=(256, 20), dtype=float32) 
Parameter dense0_bias (shape=(256,), dtype=float32) 
Parameter densel_weight (shape=(10, 256), dtype=float32) 
Parameter densel_bias (shape=(10,), dtype=float32) 

) 


As soon as we know the input dimensionality, 20, the framework can identify the shape of the 
first layer’s weight matrix by plugging in the value of 20. Having recognized the first layer’s shape, 
the framework proceeds to the second layer, and so on through the computational graph until all 
shapes are known. Note that in this case, only the first layer requires deferred initialization, but 
the framework initializes sequentially. Once all parameter shapes are known, the framework can 
finally initialize the parameters. 


Summary 


e Deferred initialization can be convenient, allowing the framework to infer parameter shapes 
automatically, making it easy to modify architectures and eliminating one common source 
of errors. 


e We can pass data through the model to make the framework finally initialize parameters. 


Exercises 
1. What happens if you specify the input dimensions to the first layer but not to subsequent 
layers? Do you get immediate initialization? 
2. What happens if you specify mismatching dimensions? 


3. What would you need to do if you have input of varying dimensionality? Hint: look at the 
parameter tying. 


Discussions®° 





30 https://discuss.d21.ai/t/280 
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5.4 Custom Layers 


One factor behind deep learning's success is the availability of a wide range of layers that can be 
composed in creative ways to design architectures suitable for a wide variety of tasks. For instance, 
researchers have invented layers specifically for handling images, text, looping over sequential 
data, and performing dynamic programming. Sooner or later, you will encounter or invent a layer 
that does not exist yet in the deep learning framework. In these cases, you must build a custom 
layer. In this section, we show you how. 


5.4.1 Layers without Parameters 


To start, we construct a custom layer that does not have any parameters of its own. This should 
look familiar if you recall our introduction to block in Section 5.1. The following CenteredLayer 
class simply subtracts the mean from its input. To build it, we simply need to inherit from the 
base layer class and implement the forward propagation function. 


from mxnet import np, npx 

from mxnet.gluon import nn 

npx.set_np() 

class CenteredLayer(nn.Block): 
def __init__(self, **kwargs): 


super().__init__(**kwargs) 


def forward(self, X): 
return X - X.mean() 


Let us verify that our layer works as intended by feeding some data through it. 


layer = CenteredLayer() 
layer(np.array([1, 2, 3, 4, 5])) 


array e 2o5 Hoy Doy ley 201) 


We can now incorporate our layer as a component in constructing more complex models. 


net = nn.Sequential() 
net.add(nn.Dense(128), CenteredLayer()) 
net.initialize() 


As an extra sanity check, we can send random data through the network and check that the mean 
is in fact 0. Because we are dealing with floating point numbers, we may still see a very small 
nonzero number due to quantization. 


Y = net(np.random.uniform(size=(4, 8))) 
Y.mean() 


array(3.783498e-10) 
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5.4.2 Layers with Parameters 


Now that we know how to define simple layers, let us move on to defining layers with parameters 
that can be adjusted through training. We can use built-in functions to create parameters, which 
provide some basic housekeeping functionality. In particular, they govern access, initialization, 
sharing, saving, and loading model parameters. This way, among other benefits, we will not need 
to write custom serialization routines for every custom layer. 


Now let us implement our own version of the fully-connected layer. Recall that this layer requires 
two parameters, one to represent the weight and the other for the bias. In this implementation, 
we bake in the ReLU activation as a default. This layer requires to input arguments: in_units and 
units, which denote the number of inputs and outputs, respectively. 


class MyDense(nn.Block): 
def __init__(self, units, in_units, *x*kwargs): 
super().__init__(**kwargs) 
self.weight = self.params.get(’weight', shape=(in_units, units)) 
self.bias = self.params.get('bias', shape=(units,)) 


def forward(self, x): 
linear = np.dot(x, self.weight.data(ctx=x.ctx)) + self.bias.data( 
ctx=x.ctx) 
return npx.relu(linear) 


Next, we instantiate the MyDense class and access its model parameters. 


dense = MyDense(units=3, in_units=5) 
dense. params 


mydense0_ ( 
Parameter mydense0_weight (shape=(5, 3), dtype=<class 'numpy.float32'>) 
Parameter mydense0_bias (shape=(3,), dtype=<class 'numpy.float32'>) 

) 


We can directly carry out forward propagation calculations using custom layers. 


dense.initialize() 
dense(np.random.uniform(size=(2, 5))) 


array(LLQ. , 0.01633355, 0. I, 
[Q. , 0.01581812, 0. 1D) 


We can also construct models using custom layers. Once we have that we can use it just like the 
built-in fully-connected layer. 


net = nn.Sequential() 
net.add(MyDense(8, in_units=64), 
MyDense(1, in_units=8)) 
net.initialize() 
net(np.random.uniform(size=(2, 64))) 
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array([[0.06508517]1, 
[0.0615553 11) 


Summary 


e We can design custom layers via the basic layer class. This allows us to define flexible new 
layers that behave differently from any existing layers in the library. 


* Once defined, custom layers can be invoked in arbitrary contexts and architectures. 


e Layers can have local parameters, which can be created through built-in functions. 


Exercises 


1. Design a layer that takes an input and computes a tensor reduction, i.e., it returns y, = 
ij Wijktitj. 
2. Design a layer that returns the leading half of the Fourier coefficients of the data. 


Discussions?! 


5.5 File I/O 


So far we discussed how to process data and how to build, train, and test deep learning models. 
However, at some point, we will hopefully be happy enough with the learned models that we will 
want to save the results for later use in various contexts (perhaps even to make predictions in de- 
ployment). Additionally, when running a long training process, the best practice is to periodically 
save intermediate results (checkpointing) to ensure that we do not lose several days worth of com- 
putation if we trip over the power cord of our server. Thus it is time to learn how to load and store 
both individual weight vectors and entire models. This section addresses both issues. 


5.5.1 Loading and Saving Tensors 


For individual tensors, we can directly invoke the load and save functions to read and write them 
respectively. Both functions require that we supply a name, and save requires as input the variable 
to be saved. 


from mxnet import np, npx 
from mxnet.gluon import nn 


npx.set_np() 


x = np.arange(4) 
npx.save('x-file', x) 


We can now read the data from the stored file back into memory. 





Sl https://discuss.d21.ai/t/58 
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x2 = npx.load('x-file’) 
x2 


Larrea. 5 ton Zo, Soll 


We can store a list of tensors and read them back into memory. 


y = np.zeros(4) 
npx.save('x-files', [x, y]) 
x2, y2 = npx.load('x-files') 
(x2, y2) 


Gras, te, Ze, Sol, EFEC. Des Das Mal) 


We can even write and read a dictionary that maps from strings to tensors. This is convenient 
when we want to read or write all the weights in a model. 


mydict = {'x': x, 'y': y) 
npx.save('mydict', mydict) 
mydict2 = npx.load('mydict') 
mydict2 


COS Erre CIO. tag Zag Sol, y array A Gey Wo, Del 


5.5.2 Loading and Saving Model Parameters 


Saving individual weight vectors (or other tensors) is useful, but it gets very tedious if we want 
to save (and later load) an entire model. After all, we might have hundreds of parameter groups 
sprinkled throughout. For this reason the deep learning framework provides built-in function- 
alities to load and save entire networks. An important detail to note is that this saves model pa- 
rameters and not the entire model. For example, if we have a 3-layer MLP, we need to specify the 
architecture separately. The reason for this is that the models themselves can contain arbitrary 
code, hence they cannot be serialized as naturally. Thus, in order to reinstate a model, we need 
to generate the architecture in code and then load the parameters from disk. Let us start with our 
familiar MLP. 


class MLP(nn.Block): 
def __init__(self, **kwargs): 
super(MLP, self).__init__(**kwargs) 
self.hidden = nn.Dense(256, activation='relu’) 
self.output = nn.Dense(10) 


def forward(self, x): 
return self.output(self.hidden(x)) 


net = MLP() 

net.initialize() 

X = np.random.uniform(size=(2, 20)) 
Y = net(X) 
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Next, we store the parameters of the model as a file with the name “mlp.params”. 


net.save_parameters('mlp.params') 


To recover the model, we instantiate a clone of the original MLP model. Instead of randomly 
initializing the model parameters, we read the parameters stored in the file directly. 


clone = MLP() 
clone. load_parameters('mlp.params’) 


Since both instances have the same model parameters, the computational result of the same input 
X should be the same. Let us verify this. 


Y_clone = clone(X) 
Y_clone == 


array(L[ True, True, True, True, True, True, True, True, True, 
True], 
[ True, True, True, True, True, True, True, True, True, 
True]]) 


Summary 


e The save and load functions can be used to perform file I/O for tensor objects. 
+ We can save and load the entire sets of parameters for a network via a parameter dictionary. 


e Saving the architecture has to be done in code rather than in parameters. 


Exercises 
1. Even if there is no need to deploy trained models to a different device, what are the practical 
benefits of storing model parameters? 


2. Assume that we want to reuse only parts of a network to be incorporated into a network 
of a different architecture. How would you go about using, say the first two layers from a 
previous network in a new network? 


3. How would you go about saving the network architecture and parameters? What restrictions 
would you impose on the architecture? 


Discussions?*? 
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5.6 GPUs 


In Table 1.5.1, we discussed the rapid growth of computation over the past two decades. In a 
nutshell, GPU performance has increased by a factor of 1000 every decade since 2000. This offers 
great opportunities but it also suggests a significant need to provide such performance. 


In this section, we begin to discuss how to harness this computational performance for your re- 
search. First by using single GPUs and at a later point, how to use multiple GPUs and multiple 
servers (with multiple GPUs). 


Specifically, we will discuss how to use a single NVIDIA GPU for calculations. First, make sure 
you have at least one NVIDIA GPU installed. Then, download the NVIDIA driver and CUDA® and 
follow the prompts to set the appropriate path. Once these preparations are complete, the nvidia- 
smi command can be used to view the graphics card information. 


Invidia-smi 


Mon Jan 18 04:51:24 2021 




















$ ee nena + 
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 

|| poe os=o ossee sess oseesessoas= === 555555555 5=-==--= Spa = + 
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC 

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. 
== = == = == = == = == = === == = =+ = == = == = 

| © Tesla V100-SXM2... Off | 00000000:00:1B.0 Off 0 

| N/A 46C PO 52W / 300W | 2911MiB / 16130MiB 0% Default 
+------------------------------- +---------------------- +---------------------- + 
| 1 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off 0 

| N/A 43C PQ 38W / 300W | 11MiB / 16130MiB 0% Default 
+------------------------------- +---------------------- +---------------------- + 
| 2 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off 0 

| N/A 53C PQ 53W / 300W | 11MiB / 16130MiB 0% Default 
+------------------------------- +---------------------- Ho + 
| 3 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off 0 

| N/A 47C PO 52W / 300W | 11MiB / 16130MiB 0% Default 
+------------------------------- +---------------------- +---------------------- + 
$ ee nena + 
| Processes: GPU Memory | 
| GPU PID Type Process name Usage | 
| 0 49228 GC = 1269MiB | 
| 0 50627 C .../envs/gluon-cv-py3-auto_test/bin/python 1631MiB | 
4$------------------------------------- == - === 5-5-5 = 5 5 5 5 5 5 5 5 5 5 5 = = == === == + 


You might have noticed that a MXNet tensor looks almost identical to a NumPy ndarray. But there 
are a few crucial differences. One of the key features that distinguishes MXNet from NumPy is its 
support for diverse hardware devices. 


In MXNet, every array has a context. So far, by default, all variables and associated computation 
have been assigned to the CPU. Typically, other contexts might be various GPUs. Things can get 
even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts intel- 
ligently, we can minimize the time spent transferring data between devices. For example, when 





$ https://developer.nvidia.com/cuda-downloads 





5.6. GPUs 219 


training neural networks on a server with a GPU, we typically prefer for the model's parameters 
to live on the GPU. 


Next, we need to confirm that the GPU version of MXNet is installed. If a CPU version of MXNet 
is already installed, we need to uninstall it first. For example, use the pip uninstall mxnet com- 
mand, then install the corresponding MXNet version according to your CUDA version. Assuming 
you have CUDA 10.0 installed, you can install the MXNet version that supports CUDA 10.0 via pip 
install mxnet-cul00. 


To run the programs in this section, you need at least two GPUs. Note that this might be extravagant 
for most desktop computers but it is easily available in the cloud, e.g., by using the AWS EC2 multi- 
GPU instances. Almost all other sections do not require multiple GPUs. Instead, this is simply to 
illustrate how data flow between different devices. 


5.6.1 Computing Devices 


We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, tensors 
are created in the main memory and then use the CPU to calculate it. 


In MXNet, the CPU and GPU can be indicated by cpu() and gpu(). It should be noted that cpu() (or 
any integer in the parentheses) means all physical CPUs and memory. This means that MXNet’s 
calculations will try to use all CPU cores. However, gpu() only represents one card and the cor- 
responding memory. If there are multiple GPUs, we use gpu(i) to represent the it GPU (i starts 
from 0). Also, gpu(@) and gpu() are equivalent. 


from mxnet import np, npx 
from mxnet.gluon import nn 


npx.set_np() 


npx.cpu(), npx.gpu(), npx.gpu(1) 


(cpu(9), gpu(®), gpu(1)) 


We can query the number of available GPUs. 


npx .num_gpus () 


Now we define two convenient functions that allow us to run code even if the requested GPUs do 
not exist. 


def try_gpu(i=0): #@save 
"""Return gpu(i) if exists, otherwise return cpu(). 
return npx.gpu(i) if npx.num_gpus() >= i + 1 else npx.cpu() 


nnn 


def try_all_gpus(): #@save 
"*"Return all available GPUs, or [cpu()] if no GPU exists.””” 
devices = [npx.gpu(i) for i in range(npx.num_gpus())] 
return devices if devices else [npx.cpu()] 


try_gpu(), try_gpu(10), try_all_gpus() 
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(gpu(0), cpu(o), [gpu(o), gpu(1)]) 


5.6.2 Tensors and GPUs 


By default, tensors are created on the CPU. We can query the device where the tensor is located. 


x = np.array([1, 2, 3]) 
x.ctx 


cpu(0) 


It is important to note that whenever we want to operate on multiple terms, they need to be on the 
same device. For instance, if we sum two tensors, we need to make sure that both arguments live 
on the same device—otherwise the framework would not know where to store the result or even 
how to decide where to perform the computation. 


Storage on the GPU 


There are several ways to store a tensor on the GPU. For example, we can specify a storage device 
when creating a tensor. Next, we create the tensor variable X on the first gpu. The tensor created 
on a GPU only consumes the memory of this GPU. We can use the nvidia-smi command to view 
GPU memory usage. In general, we need to make sure that we do not create data that exceed the 
GPU memory limit. 


X = np.ones((2, 3), ctx=try_gpu()) 
X 


anra EEE doy Lol, 
[1., 1., 1.]], ctx=gpu(@)) 
Assuming that you have at least two GPUs, the following code will create a random tensor on the 


second GPU. 


Y = np.random.uniform(size=(2, 3), ctx=try_gpu(1)) 
Y 


array([[0.67478997, 0.07540122, 0.9956977 ], 
[0.09488854, 0.415456 , 0.11231736]], ctx=gpu(1)) 
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Copying 


If we want to compute X + Y, we need to decide where to perform this operation. For instance, 
as shown in Fig. 5.6.1, we can transfer X to the second GPU and perform the operation there. Do 
not simply add X and Y, since this will result in an exception. The runtime engine would not know 
what to do: it cannot find data on the same device and it fails. Since Y lives on the second GPU, we 
need to move X there before we can add the two. 






copy 
gpu(0) 


Fig. 5.6.1: Copy data to perform an operation on the same device. 


Z = X.copyto(try_gpu(1)) 


print(X) 
print(Z) 
Ed. o. We] 
[1. 1. 1.]] €gpu(0) 
Ett. il, ded 
Ed. it. Dodd) ES PUE) 


Now that the data are on the same GPU (both Z and Y are), we can add them up. 


YRZ 


array([[1.6747899, 1.0754012, 1.9956977], 
[1.0948886, 1.415456 , 1.1123173]], ctx=gpu(1)) 


Imagine that your variable Z already lives on your second GPU. What happens if we still call Z. 
copyto(gpu(1))? It will make a copy and allocate new memory, even though that variable already 
lives on the desired device. There are times where, depending on the environment our code is 
running in, two variables may already live on the same device. So we want to make a copy only 
if the variables currently live in different devices. In these cases, we can call as_in_ctx. If the 
variable already live in the specified device then this is a no-op. Unless you specifically want to 
make a copy, as_in_ctx is the method of choice. 


Z.as_in_ctx(try_gpu(1)) is Z 


True 
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Side Notes 


People use GPUs to do machine learning because they expect them to be fast. But transferring 
variables between devices is slow. So we want you to be 100% certain that you want to do some- 
thing slow before we let you do it. If the deep learning framework just did the copy automatically 
without crashing then you might not realize that you had written some slow code. 


Also, transferring data between devices (CPU, GPUs, and other machines) is something that is 
much slower than computation. It also makes parallelization a lot more difficult, since we have to 
wait for data to be sent (or rather to be received) before we can proceed with more operations. This 
is why copy operations should be taken with great care. Asa rule ofthumb, many small operations 
are much worse than one big operation. Moreover, several operations at a time are much better 
than many single operations interspersed in the code unless you know what you are doing. This 
is the case since such operations can block if one device has to wait for the other before it can do 
something else. Itis a bit like ordering your coffee in a queue rather than pre-ordering it by phone 
and finding out that it is ready when you are. 


Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the main 
memory, the framework will copy it to the main memory first, resulting in additional transmis- 
sion overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes 
everything wait for Python to complete. 


5.6.3 Neural Networks and GPUs 


Similarly, a neural network model can specify devices. The following code puts the model param- 
eters on the GPU. 


net = nn.Sequential() 


net .add(nn.Dense(1)) 
net.initialize(ctx=try_gpu()) 


We will see many more examples of how to run models on GPUs in the following chapters, simply 
since they will become somewhat more computationally intensive. 


When the input is a tensor on the GPU, the model will calculate the result on the same GPU. 


net (X) 


array([[0.04995865], 
[0.04995865]], ctx=gpu(0)) 


Let us confirm that the model parameters are stored on the same GPU. 
net[0].weight.data().ctx 
gpu(o) 


In short, as long as all data and parameters are on the same device, we can learn models efficiently. 
In the following chapters we will see several such examples. 
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Summary 


e We can specify devices for storage and calculation, such as the CPU or GPU. By default, data 
are created in the main memory and then use the CPU for calculations. 


+ The deep learning framework requires all input data for calculation to be on the same device, 
be it CPU or the same GPU. 


e You can lose significant performance by moving data without care. A typical mistake is as 
follows: computing the loss for every minibatch on the GPU and reporting it back to the user 
on the command line (or logging it in a NumPy ndarray) will trigger a global interpreter lock 
which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and 
only move larger logs. 


Exercises 


1. Try a larger computation task, such as the multiplication of large matrices, and see the dif- 
ference in speed between the CPU and GPU. What about a task with a small amount of cal- 
culations? 


2. How should we read and write model parameters on the GPU? 


3. Measure the time it takes to compute 1000 matrix-matrix multiplications of 100 x 100 matri- 
ces and log the Frobenius norm of the output matrix one result at a time vs. keeping a log on 
the GPU and transferring only the final result. 


4. Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs 
at the same time vs. in sequence on one GPU. Hint: you should see almost linear scaling. 


Discussions®* 
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6 Convolutional Neural Networks 


In earlier chapters, we came up against image data, for which each example consists of a two- 
dimensional grid of pixels. Depending on whether we are handling black-and-white or color im- 
ages, each pixel location might be associated with either one or multiple numerical values, respec- 
tively. Until now, our way of dealing with this rich structure was deeply unsatisfying. We simply 
discarded each image's spatial structure by flattening them into one-dimensional vectors, feeding 
them through a fully-connected MLP. Because these networks are invariant to the order ofthe fea- 
tures, we could get similar results regardless of whether we preserve an order corresponding to 
the spatial structure of the pixels or if we permute the columns of our design matrix before fitting 
the MLP' parameters. Preferably, we would leverage our prior knowledge that nearby pixels are 
typically related to each other, to build efficient models for learning from image data. 


This chapter introduces convolutional neural networks (CNNs), a powerful family of neural networks 
that are designed for precisely this purpose. CNN-based architectures are now ubiquitous in the 
field of computer vision, and have become so dominant that hardly anyone today would develop a 
commercial application or enter a competition related to image recognition, object detection, or 
semantic segmentation, without building off of this approach. 


Modern CNNs, as they are called colloquially owe their design to inspirations from biology, group 
theory, and a healthy dose of experimental tinkering. In addition to their sample efficiency in 
achieving accurate models, CNNs tend to be computationally efficient, both because they require 
fewer parameters than fully-connected architectures and because convolutions are easy to par- 
allelize across GPU cores. Consequently, practitioners often apply CNNs whenever possible, and 
increasingly they have emerged as credible competitors even on tasks with a one-dimensional se- 
quence structure, such as audio, text, and time series analysis, where recurrent neural networks 
are conventionally used. Some clever adaptations of CNNs have also brought them to bear on 
graph-structured data and in recommender systems. 


First, we will walk through the basic operations that comprise the backbone of all convolu- 
tional networks. These include the convolutional layers themselves, nitty-gritty details includ- 
ing padding and stride, the pooling layers used to aggregate information across adjacent spatial 
regions, the use of multiple channels at each layer, and a careful discussion of the structure of 
modern architectures. We will conclude the chapter with a full working example of LeNet, the 
first convolutional network successfully deployed, long before the rise of modern deep learning. 
In the next chapter, we will dive into full implementations of some popular and comparatively 
recent CNN architectures whose designs represent most of the techniques commonly used by 
modern practitioners. 
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6.1 From Fully-Connected Layers to Convolutions 


To this day, the models that we have discussed so far remain appropriate options when we are 
dealing with tabular data. By tabular, we mean that the data consist of rows corresponding to 
examples and columns corresponding to features. With tabular data, we might anticipate that 
the patterns we seek could involve interactions among the features, but we do not assume any 
structure a priori concerning how the features interact. 


Sometimes, we truly lack knowledge to guide the construction of craftier architectures. In these 
cases, an MLP may be the best that we can do. However, for high-dimensional perceptual data, 
such structure-less networks can grow unwieldy. 


For instance, let us return to our running example of distinguishing cats from dogs. Say that we do 
a thorough job in data collection, collecting an annotated dataset of one-megapixel photographs. 
This means that each input to the network has one million dimensions. According to our discus- 
sions of parameterization cost of fully-connected layers in Section 3.4.3, even an aggressive re- 
duction to one thousand hidden dimensions would require a fully-connected layer characterized 
by 10% x 10% = 10° parameters. Unless we have lots of GPUs, a talent for distributed optimization, 
and an extraordinary amount of patience, learning the parameters of this network may turn out 
to be infeasible. 


A careful reader might object to this argument on the basis that one megapixel resolution may not 
be necessary. However, while we might be able to get away with one hundred thousand pixels, 
our hidden layer of size 1000 grossly underestimates the number of hidden units that it takes to 
learn good representations of images, so a practical system will still require billions of parame- 
ters. Moreover, learning a classifier by fitting so many parameters might require collecting an 
enormous dataset. And yet today both humans and computers are able to distinguish cats from 
dogs quite well, seemingly contradicting these intuitions. That is because images exhibit rich 
structure that can be exploited by humans and machine learning models alike. Convolutional 
neural networks (CNNs) are one creative way that machine learning has embraced for exploiting 
some of the known structure in natural images. 


6.1.1 Invariance 


Imagine that you want to detect an object in an image. It seems reasonable that whatever method 
we use to recognize objects should not be overly concerned with the precise location of the ob- 
ject in the image. Ideally, our system should exploit this knowledge. Pigs usually do not fly and 
planes usually do not swim. Nonetheless, we should still recognize a pig were one to appear at the 
top of the image. We can draw some inspiration here from the children’s game “Where’s Waldo” 
(depicted in Fig. 6.1.1). The game consists of a number of chaotic scenes bursting with activities. 
Waldo shows up somewhere in each, typically lurking in some unlikely location. The reader’s goal 
is to locate him. Despite his characteristic outfit, this can be surprisingly difficult, due to the large 
number of distractions. However, what Waldo looks like does not depend upon where Waldo is lo- 
cated. We could sweep the image with a Waldo detector that could assign a score to each patch, 
indicating the likelihood that the patch contains Waldo. CNNs systematize this idea of spatial in- 
variance, exploiting it to learn useful representations with fewer parameters. 
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Fig. 6.1.1: An image of the “Where's Waldo” game. 


We can now make these intuitions more concrete by enumerating a few desiderata to guide our 
design of a neural network architecture suitable for computer vision: 


1. In the earliest layers, our network should respond similarly to the same patch, regardless of 
where it appears in the image. This principle is called translation invariance. 


2. The earliest layers of the network should focus on local regions, without regard for the con- 
tents of the image in distant regions. This is the locality principle. Eventually, these local 
representations can be aggregated to make predictions at the whole image level. 


Let us see how this translates into mathematics. 


6.1.2 Constraining the MLP 


To start off, we can consider an MLP with two-dimensional images X as inputs and their imme- 
diate hidden representations H similarly represented as matrices in mathematics and as two- 
dimensional tensors in code, where both X and H have the same shape. Let that sink in. We now 
conceive of not only the inputs but also the hidden representations as possessing spatial structure. 


Let [X];,; and [H]; ; denote the pixel at location (i, j) in the input image and hidden representation, 
respectively. Consequently, to have each of the hidden units receive input from each of the input 
pixels, we would switch from using weight matrices (as we did previously in MLPs) to represent- 
ing our parameters as fourth-order weight tensors W. Suppose that U contains biases, we could 
formally express the fully-connected layer as 


H]; = lag +) Wi [X]e 
k l 


, (6.1.1) 
= [Whig + 5 Y Maila 
a b 


where the switch from W to V is entirely cosmetic for now since there is a one-to-one correspon- 
dence between coefficients in both fourth-order tensors. We simply re-index the subscripts (k,l) 
such that k = i + a and l = j + b. In other words, we set [V]; jab = [W]ij,i+a,j+b- The indices a and 
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b run over both positive and negative offsets, covering the entire image. For any given location (i, 
j) in the hidden representation [H]; j, we compute its value by summing over pixels in x, centered 
around (i, j) and weighted by [V]; j,a,b- 


Translation Invariance 


Now let us invoke the first principle established above: translation invariance. This implies that 
a shift in the input X should simply lead to a shift in the hidden representation H. This is only 
possible if Vand U do not actually depend on (i, j), i.e., we have [V]; jap = [V]a p and U is a constant, 
say u. As a result, we can simplify the definition for H: 


Hig = u + SOS Ma olXiva s+: (6.1.2) 
a b 


This is a convolution! We are effectively weighting pixels at (i + a, j + b) in the vicinity of location 
(i, j) with coefficients [V]a » to obtain the value [H];,;. Note that [V],,, needs many fewer coefficients 
than [V]; ja p since it no longer depends on the location within the image. We have made significant 
progress! 


Locality 


Now let us invoke the second principle: locality. As motivated above, we believe that we should 
not have to look very far away from location (i, j) in order to glean relevant information to assess 
what is going on at [H]; ;. This means that outside some range |a| > A or |b| > A, we should set 
[V]a,» = 0. Equivalently, we can rewrite [H];,; as 


A A 
[H]; j =u+ y y [Via oXlita,j+b- (6.1.3) 


a=—A b=—A 


Note that (6.1.3), in a nutshell, is a convolutional layer. Convolutional neural networks (CNNs) are 
a special family of neural networks that contain convolutional layers. In the deep learning re- 
search community, V is referred to as a convolution kernel, a filter, or simply the layer’s weights that 
are often learnable parameters. When the local region is small, the difference as compared with 
a fully-connected network can be dramatic. While previously, we might have required billions 
of parameters to represent just a single layer in an image-processing network, we now typically 
need just a few hundred, without altering the dimensionality of either the inputs or the hidden 
representations. The price paid for this drastic reduction in parameters is that our features are 
now translation invariant and that our layer can only incorporate local information, when de- 
termining the value of each hidden activation. All learning depends on imposing inductive bias. 
When that bias agrees with reality, we get sample-efficient models that generalize well to unseen 
data. But of course, if those biases do not agree with reality, e.g., if images turned out not to be 
translation invariant, our models might struggle even to fit our training data. 
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6.1.3 Convolutions 


Before going further, we should briefly review why the above operation is called a convolution. In 
mathematics, the convolution between two functions, say f, g : R — R is defined as 


(f * 9)( x)= [si Z. (6.1.4) 


That is, we measure the overlap between f and g when one function is “flipped” and shifted by x. 
Whenever we have discrete objects, the integral turns into a sum. For instance, for vectors from 
the set of square summable infinite dimensional vectors with index running over Z we obtain the 
following definition: 


(f *g)(i = 210) g(i— a) (6.1.5) 


For two-dimensional tensors, we have a corresponding sum with indices (a, b) for f and (i—a, j—b) 
for g, respectively: 


(f * gij) =X X Fla, b)gli — a, j — b). (6.1.6) 
a b 


This looks similar to (6.1.3), with one major difference. Rather than using (i+, j+b), we are using 
the difference instead. Note, though, that this distinction is mostly cosmetic since we can always 
match the notation between (6.1.3) and (6.1.6). Our original definition in (6.1.3) more properly 
describes a cross-correlation. We will come back to this in the following section. 


6.1.4 “Where’s Waldo” Revisited 


Returning to our Waldo detector, let us see what this looks like. The convolutional layer picks 
windows of a given size and weighs intensities according to the filter V, as demonstrated in Fig. 
6.1.2. We might aim to learn a model so that wherever the “waldoness” is highest, we should find 
a peak in the hidden layer representations. 


2 
vs 


Fig. 6.1.2: Detect Waldo. 
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Channels 


There is just one problem with this approach. So far, we blissfully ignored that images consist 
of 3 channels: red, green, and blue. In reality, images are not two-dimensional objects but rather 
third-order tensors, characterized by a height, width, and channel, e.g., with shape 1024 x 1024 x 3 
pixels. While the first two of these axes concern spatial relationships, the third can be regarded 
as assigning a multidimensional representation to each pixel location. We thus index X as [X]; ;,1.. 
The convolutional filter has to adapt accordingly. Instead of [V],, p, we now have [V], pc. 


Moreover, justas our input consists of a third-order tensor, it turns out to be a good idea to similarly 
formulate our hidden representations as third-order tensors H. In other words, rather than just 
having a single hidden representation corresponding to each spatial location, we want an entire 
vector of hidden representations corresponding to each spatial location. We could think of the 
hidden representations as comprising a number of two-dimensional grids stacked on top of each 
other. Asin the inputs, these are sometimes called channels. They are also sometimes called feature 
maps, as each provides a spatialized set of learned features to the subsequent layer. Intuitively, 
you might imagine that at lower layers that are closer to inputs, some channels could become 
specialized to recognize edges while others could recognize textures. 


To support multiple channels in both inputs (X) and hidden representations (H), we can add a 
fourth coordinate to V: [V]a bca. Putting everything together we have: 


A A 
[H]; ja = y y > Moeite (6.1.7) 


a=—A b=—A 


where d indexes the output channels in the hidden representations H. The subsequent convolu- 
tional layer will go on to take a third-order tensor, H, as the input. Being more general, (6.1.7) is 
the definition of a convolutional layer for multiple channels, where V is a kernel or filter of the 
layer. 


There are still many operations that we need to address. For instance, we need to figure out how to 
combine all the hidden representations to a single output, e.g., whether there is a Waldo anywhere 
in the image. We also need to decide how to compute things efficiently, how to combine multi- 
ple layers, appropriate activation functions, and how to make reasonable design choices to yield 
networks that are effective in practice. We turn to these issues in the remainder of the chapter. 


Summary 


Translation invariance in images implies that all patches of an image will be treated in the 
same manner. 


Locality means that only a small neighborhood of pixels will be used to compute the corre- 
sponding hidden representations. 


In image processing, convolutional layers typically require many fewer parameters than 
fully-connected layers. 


CNNS are a special family of neural networks that contain convolutional layers. 


Channels on input and output allow our model to capture multiple aspects of an image at 
each spatial location. 
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Exercises 


1. Assume that the size of the convolution kernel is A = 0. Show that in this case the convolu- 
tion kernel implements an MLP independently for each set of channels. 


2. Why might translation invariance not be a good idea after all? 


3. What problems must we deal with when deciding how to treat hidden representations cor- 
responding to pixel locations at the boundary of an image? 


4. Describe an analogous convolutional layer for audio. 


5. Do you think that convolutional layers might also be applicable for text data? Why or why 
not? 


6. Prove that f *g=g* f. 


Discussions?’ 


6.2 Convolutions for Images 


Now that we understand how convolutional layers work in theory, we are ready to see how they 
work in practice. Building on our motivation of convolutional neural networks as efficient archi- 
tectures for exploring structure in image data, we stick with images as our running example. 


6.2.1 The Cross-Correlation Operation 


Recall that strictly speaking, convolutional layers are a misnomer, since the operations they ex- 
press are more accurately described as cross-correlations. Based on our descriptions of convolu- 
tional layers in Section 6.1, in such a layer, an input tensor and a kernel tensor are combined to 
produce an output tensor through a cross-correlation operation. 


Let us ignore channels for now and see how this works with two-dimensional data and hidden 
representations. In Fig. 6.2.1, the input is a two-dimensional tensor with a height of 3 and width 
of 3. We mark the shape of the tensor as 3 x 3 or (3, 3). The height and width of the kernel are 
both 2. The shape of the kernel window (or convolution window) is given by the height and width of 
the kernel (here itis 2 x 2). 


Output 





Fig. 6.2.1: Two-dimensional cross-correlation operation. The shaded portions are the first output 
element as well as the input and kernel tensor elements used for the output computation: 0 x 0 + 
1x1+3x2+4x3=19. 


In the two-dimensional cross-correlation operation, we begin with the convolution window posi- 
tioned at the top-left corner of the input tensor and slide it across the input tensor, both from left 
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to right and top to bottom. When the convolution window slides to a certain position, the input 
subtensor contained in that window and the kernel tensor are multiplied elementwise and the re- 
sulting tensor is summed up yielding a single scalar value. This result gives the value ofthe output 
tensor at the corresponding location. Here, the output tensor has a height of 2 and width of 2 and 
the four elements are derived from the two-dimensional cross-correlation operation: 


0x0+1x1+3x2+4x3=109, 
1x0+2x1+4x2+5x3=25, 
3x04+4x1+6x2+7x3=37, 
4Ax0+5x1+7x2+8x3= 43. 


(6.2.1) 








Note that along each axis, the output size is slightly smaller than the input size. Because the kernel 
has width and height greater than one, we can only properly compute the cross-correlation for 
locations where the kernel fits wholly within the image, the output size is given by the input size 
Np X Ny Minus the size of the convolution kernel kp x ku via 


(np — kn +1) X (nw — ku +1). (6.2.2) 


This is the case since we need enough space to “shift” the convolution kernel across the image. 
Later we will see how to keep the size unchanged by padding the image with zeros around its 
boundary so that there is enough space to shift the kernel. Next, we implement this process in 
the corr2d function, which accepts an input tensor X and a kernel tensor K and returns an output 
tensor Y. 


from d21 import mxnet as d21 

from mxnet import autograd, np, npx 
from mxnet.gluon import nn 
npx.set_np() 


def corr2d(X, K): #@save 
"""Compute 2D cross-correlation. 
h, w = K.shape 
Y = np.zeros((X.shape[@] - h + 1, X.shape[1] - w + 1)) 
for i in range(Y.shape[Q]): 
for j in range(Y.shape[1]): 
YCi, j] = (XLi:i + h, j:j + w] * K).sum() 
return Y 


nnn 


We can construct the input tensor X and the kernel tensor K from Fig. 6.2.1 to validate the output 
of the above implementation of the two-dimensional cross-correlation operation. 


2o E 


X = np.array([[2.0, > > 
o, 2.0, 


K = np.array([L0.0, 
corr2d(X, K) 


0, 4.0, 5.0], [6.0, 7.0, 8.0]]) 


1.0 2, 
1.0 3.011) 


array CUL 
Esa Sa 
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6.2.2 Convolutional Layers 


A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an 
output. The two parameters of a convolutional layer are the kernel and the scalar bias. When 
training models based on convolutional layers, we typically initialize the kernels randomly, just 
as we would with a fully-connected layer. 


We are now ready to implement a two-dimensional convolutional layer based on the corr2d func- 
tion defined above. Inthe __init__ constructor function, we declare weight and bias as the two 
model parameters. The forward propagation function calls the corr2d function and adds the bias. 


class Conv2D(nn.Block): 
def __init__(self, kernel_size, *xkwargs): 


super().__init__(**kwargs) 
self.weight = self .params.get('weight', shape=kernel_size) 


self.bias = self.params.get('bias', shape=(1,)) 


def forward(self, x): 
return corr2d(x, self.weight.data()) + self.bias.data() 


In h x w convolution or a h x w convolution kernel, the height and width of the convolution kernel 
are h and w, respectively. We also refer to a convolutional layer with a h x w convolution kernel 
simply as a h x w convolutional layer. 


6.2.3 Object Edge Detection in Images 


Let us take a moment to parse a simple application of a convolutional layer: detecting the edge of 
an object in an image by finding the location of the pixel change. First, we construct an “image” 
of 6 x 8 pixels. The middle four columns are black (0) and the rest are white (1). 


X = np.ones((6, 8)) 


X[:, 2:6] = 0 

X 

a O ere ale 
Pl os Mr HOM 0 e leo 
o a Us le 
ao Oa ao 
ala ie 
Dilag lan Oss Osp Oop Oso Ioa 1a TI 


Next, we construct a kernel K with a height of 1 and a width of 2. When we perform the cross- 
correlation operation with the input, ifthe horizontally adjacent elements are the same, the output 
is 0. Otherwise, the output is non-zero. 


K = np.array([[1.0, -1.01]) 


We are ready to perform the cross-correlation operation with arguments X (our input) and K (our 
kernel). As you can see, we detect 1 for the edge from white to black and -1 for the edge from black 
to white. All other outputs take value 0. 





6.2. Convolutions for Images 233 


Y = corr2d(X, K) 
Y 


array([[ 0., 1., 0., 0., 0., -1., 0.1, 
ths ea e A, O AN da 
PRO A O a e SE 
È Osa lao Ga Da Oa ie Ol 
L Org oa Ory Oo Org os Osl 
Eas ies Oon Our Oop Hl. 0I 


We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel K only 
detects vertical edges. 


corr2d(X.T, K) 


array([[0., 0., 0., 0., 0.], 
Oop Ooo Oon Oop Ool; 
des Oog Ora Des Del, 
lO... On, 0%, Or 001, 
des Dos Des Des Gell, 
E Oon Os O 
[0., 0., Ooa 0., 0.], 
Oas Oos Oop Oos 0-1 





6.2.4 Learning a Kernel 


Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what 
we are looking for. However, as we look at larger kernels, and consider successive layers of con- 
volutions, it might be impossible to specify precisely what each filter should be doing manually. 


Now let us see whether we can learn the kernel that generated Y from X by looking at the input- 
output pairs only. We first construct a convolutional layer and initialize its kernel as a random 
tensor. Next, in each iteration, we will use the squared error to compare Y with the output of 
the convolutional layer. We can then calculate the gradient to update the kernel. For the sake of 
simplicity, in the following we use the built-in class for two-dimensional convolutional layers and 
ignore the bias. 


# Construct a two-dimensional convolutional layer with 1 output channel and a 
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here 
conv2d = nn.Conv2D(1, kernel_size=(1, 2), use_bias=False) 

conv2d.initialize() 


The two-dimensional convolutional layer uses four-dimensional input and 
output in the format of (example, channel, height, width), where the batch 
size (number of examples in the batch) and the number of channels are both 1 
= X.reshape(1, 1, 6, 8) 

Y.reshape(1, 1, 6, 7) 


<< +H H H 


for i in range(10): 
with autograd.record(): 
Y_hat = conv2d(X) 


(continues on next page) 
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(continued from previous page) 


1 = (Y_hat - Y) xx 2 
1. backward() 
# Update the kernel 
conv2d.weight.data()L:] -= 3e-2 * conv2d.weight.grad() 
if (i + 1) % 2 == @: 
print(f’batch {i + 1}, loss {float(1.sum()): .3f}’) 


batch 2, loss 4.949 
batch 4, loss 0.831 
batch 6, loss 0.140 
batch 8, loss 0.024 
batch 10, loss 0.004 


Note that the error has dropped to a small value after 10 iterations. Now we will take a look at the 
kernel tensor we learned. 


conv2d.weight.data().reshape((1, 2)) 


array([L 0.9895 , -@.9873705]]) 


Indeed, the learned kernel tensor is remarkably close to the kernel tensor K we defined earlier. 


6.2.5 Cross-Correlation and Convolution 


Recall our observation from Section 6.1 of the correspondence between the cross-correlation and 
convolution operations. Here let us continue to consider two-dimensional convolutional layers. 
What if such layers perform strict convolution operations as defined in (6.1.6) instead of cross- 
correlations? In order to obtain the output of the strict convolution operation, we only need to flip 
the two-dimensional kernel tensor both horizontally and vertically, and then perform the cross- 
correlation operation with the input tensor. 


It is noteworthy that since kernels are learned from data in deep learning, the outputs of con- 
volutional layers remain unaffected no matter such layers perform either the strict convolution 
operations or the cross-correlation operations. 


To illustrate this, suppose that a convolutional layer performs cross-correlation and learns the ker- 
nel in Fig. 6.2.1, which is denoted as the matrix K here. Assuming that other conditions remain 
unchanged, when this layer performs strict convolution instead, the learned kernel K’ will be the 
same as K after K’ is flipped both horizontally and vertically. That is to say, when the convolu- 
tional layer performs strict convolution for the input in Fig. 6.2.1 and K’, the same output in Fig. 
6.2.1 (cross-correlation of the input and K) will be obtained. 


In keeping with standard terminology with deep learning literature, we will continue to refer to the 
cross-correlation operation as a convolution even though, strictly-speaking, it is slightly different. 
Besides, we use the term element to refer to an entry (or component) of any tensor representing a 
layer representation or a convolution kernel. 





6.2. Convolutions for Images 235 


6.2.6 Feature Map and Receptive Field 


As described in Section 6.1.4, the convolutional layer output in Fig. 6.2.1 is sometimes called a fea- 
ture map, as it can be regarded as the learned representations (features) in the spatial dimensions 
(e.g., width and height) to the subsequent layer. In CNNs, for any element x of some layer, its re- 
ceptive field refers to all the elements (from all the previous layers) that may affect the calculation 
of x during the forward propagation. Note that the receptive field may be larger than the actual 
size of the input. 


Let us continue to use Fig. 6.2.1 to explain the receptive field. Given the 2 x 2 convolution kernel, 
the receptive field of the shaded output element (of value 19) is the four elements in the shaded 
portion of the input. Now let us denote the 2 x 2 output as Y and consider a deeper CNN with an 
additional 2 x 2 convolutional layer that takes Y as its input, outputting a single element z. In this 
case, the receptive field of z on Y includes all the four elements of Y, while the receptive field on 
the input includes all the nine input elements. Thus, when any element in a feature map needs a 
larger receptive field to detect input features over a broader area, we can build a deeper network. 


Summary 


The core computation of a two-dimensional convolutional layer is a two-dimensional cross- 
correlation operation. In its simplest form, this performs a cross-correlation operation on 
the two-dimensional input data and the kernel, and then adds a bias. 


We can design a kernel to detect edges in images. 


We can learn the kernel's parameters from data. 


With kernels learned from data, the outputs of convolutional layers remain unaffected 
regardless of such layers’ performed operations (either strict convolution or cross- 
correlation). 


When any element in a feature map needs a larger receptive field to detect broader features 
on the input, a deeper network can be considered. 


Exercises 


1. Construct an image X with diagonal edges. 
1. What happens if you apply the kernel K in this section to it? 
2. What happens if you transpose X? 
3. What happens if you transpose K? 


2. When you try to automatically find the gradient for the Conv2D class we created, what kind 
of error message do you see? 


3. How do you represent a cross-correlation operation as a matrix multiplication by changing 
the input and kernel tensors? 


4. Design some kernels manually. 
1. What is the form of a kernel for the second derivative? 


2. What is the kernel for an integral? 
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3. What is the minimum size of a kernel to obtain a derivative of degree d? 


Discussions 


6.3 Padding and Stride 


In the previous example of Fig. 6.2.1, our input had both a height and width of 3 and our convo- 
lution kernel had both a height and width of 2, yielding an output representation with dimension 
2 x 2. As we generalized in Section 6.2, assuming that the input shape is n, x nuy and the convolu- 
tion kernel shape is kp x kw, then the output shape will be (np — kp +1) x (nw — kw +1). Therefore, 
the output shape of the convolutional layer is determined by the shape of the input and the shape 
of the convolution kernel. 


In several cases, we incorporate techniques, including padding and strided convolutions, that af- 
fect the size of the output. As motivation, note that since kernels generally have width and height 
greater than 1, after applying many successive convolutions, we tend to wind up with outputs that 
are considerably smaller than our input. If we start with a 240 x 240 pixel image, 10 layers of 5 x 5 
convolutions reduce the image to 200 x 200 pixels, slicing off 30% of the image and with it oblit- 
erating any interesting information on the boundaries of the original image. Padding is the most 
popular tool for handling this issue. 


In other cases, we may want to reduce the dimensionality drastically, e.g., if we find the original 
input resolution to be unwieldy. Strided convolutions are a popular technique that can help in these 
instances. 


6.3.1 Padding 


As described above, one tricky issue when applying convolutional layers is that we tend to lose 
pixels on the perimeter of our image. Since we typically use small kernels, for any given convo- 
lution, we might only lose a few pixels, but this can add up as we apply many successive convolu- 
tional layers. One straightforward solution to this problem is to add extra pixels of filler around 
the boundary of our input image, thus increasing the effective size of the image. Typically, we 
set the values of the extra pixels to zero. In Fig. 6.3.1, we pad a 3 x 3 input, increasing its size to 
5 x 5. The corresponding output then increases to a 4 x 4 matrix. The shaded portions are the first 
output element as well as the input and kernel tensor elements used for the output computation: 
0x04+0x14+0x2+0x3=0. 


Input Kernel Output 
pog 
= [e[efæfo] 


= aferaj 
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Fig. 6.3.1: Two-dimensional cross-correlation with padding. 
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In general, if we add a total of p, rows of padding (roughly half on top and half on bottom) and 
a total of p,, columns of padding (roughly half on the left and half on the right), the output shape 
will be 


(nn — kn + pr +1) X (nw — kw + pw +1). (6.3.1) 


This means that the height and width of the output will increase by p, and p,,, respectively. 


In many cases, we will want to set p, = kp — 1 and py = ky — 1 to give the input and output the 
same height and width. This will make it easier to predict the output shape of each layer when 
constructing the network. Assuming that kp is odd here, we will pad p, /2 rows on both sides of 
the height. If k, is even, one possibility is to pad |[p;,/2| rows on the top of the input and |p, /2| 
rows on the bottom. We will pad both sides of the width in the same way. 


CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 5, or 7. 
Choosing odd kernel sizes has the benefit that we can preserve the spatial dimensionality while 
padding with the same number of rows on top and bottom, and the same number of columns on 
left and right. 


Moreover, this practice of using odd kernels and padding to precisely preserve dimensionality 
offers a clerical benefit. For any two-dimensional tensor X, when the kernel's size is odd and the 
number of padding rows and columns on all sides are the same, producing an output with the same 
height and width as the input, we know that the output Y[i, j] is calculated by cross-correlation 
of the input and convolution kernel with the window centered on X[i, j]. 


In the following example, we create a two-dimensional convolutional layer with a height and width 
of 3 and apply 1 pixel of padding on all sides. Given an input with a height and width of 8, we find 
that the height and width of the output is also 8. 


from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


# For convenience, we define a function to calculate the convolutional layer. 

# This function initializes the convolutional layer weights and performs 

# corresponding dimensionality elevations and reductions on the input and 

# output 

def comp_conv2d(conv2d, X): 
conv2d.initialize() 
# Here (1, 1) indicates that the batch size and the number of channels 
# are both 1 
X = X.reshape((1, 1) + X.shape) 
Y = conv2d(X) 
# Exclude the first two dimensions that do not interest us: examples and 
# channels 

return Y.reshape(Y.shape[2: ]) 


# Note that here 1 row or column is padded on either side, so a total of 2 
# rows or columns are added 

conv2d = nn.Conv2D(1, kernel_size=3, padding=1) 

X = np.random.uniform(size=(8, 8)) 

comp_conv2d(conv2d, X).shape 


(8, 8) 
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When the height and width of the convolution kernel are different, we can make the output and 
input have the same height and width by setting different padding numbers for height and width. 


# Here, we use a convolution kernel with a height of 5 and a width of 3. The 
# padding numbers on either side of the height and width are 2 and 1, 

# respectively 

conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1)) 
comp_conv2d(conv2d, X).shape 


(8, 8) 


6.3.2 Stride 


When computing the cross-correlation, we start with the convolution window at the top-left cor- 
ner of the input tensor, and then slide it over all locations both down and to the right. In previous 
examples, we default to sliding one element at a time. However, sometimes, either for computa- 
tional efficiency or because we wish to downsample, we move our window more than one element 
at a time, skipping the intermediate locations. 


We refer to the number of rows and columns traversed per slide as the stride. So far, we have used 
strides of 1, both for height and width. Sometimes, we may want to use a larger stride. Fig. 6.3.2 
shows a two-dimensional cross-correlation operation with a stride of 3 vertically and 2 horizon- 
tally. The shaded portions are the output elements as well as the input and kernel tensor elements 
used for the output computation: 0x0+0x1+1x2+2x3=8,0x0+6x1+0x2+0x3 = 6. 
We can see that when the second element of the first column is outputted, the convolution win- 
dow slides down three rows. The convolution window slides two columns to the right when the 
second element of the first row is outputted. When the convolution window continues to slide 
two columns to the right on the input, there is no output because the input element cannot fill the 
window (unless we add another column of padding). 


Input Kernel Output 





Fig. 6.3.2: Cross-correlation with strides of 3 and 2 for height and width, respectively. 


In general, when the stride for the height is s, and the stride for the width is s,,, the output shape 
is 


(nr — kn + Ph + Sn)/sn] = | (nw — kw + Pw + Sw)/Sw). (6.3.2) 


If we set pp = kp — 1 and py = kw — 1, then the output shape will be simplified to | (n, + sp — 
1)/sp] X | (Mw + Sw — 1)/5w]. Going a step further, if the input height and width are divisible by 
the strides on the height and width, then the output shape will be (n;, /sy) X (Mw /Sw). 


Below, we setthe strides on both the height and width to 2, thus halving the input height and width. 
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conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2) 
comp_conv2d(conv2d, X).shape 


Gre 


Next, we will look at a slightly more complicated example. 


conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4)) 
comp_conv2d(conv2d, X).shape 


(2, 2) 


For the sake of brevity, when the padding number on both sides of the input height and width are 
pp and py respectively, we call the padding (pr, pw). Specifically, when p, = py = p, the padding is 
p. When the strides on the height and width are sp and su, respectively, we call the stride (sp, sw). 
Specifically, when sp = Sw = s, the stride is s. By default, the padding is 0 and the stride is 1. 
In practice, we rarely use inhomogeneous strides or padding, i.e., we usually have p, = py and 
Sh = Sins 


Summary 
e Padding can increase the height and width of the output. This is often used to give the output 
the same height and width as the input. 


° The stride can reduce the resolution of the output, for example reducing the height and width 
of the output to only 1/n of the height and width of the input (n is an integer greater than 1). 


e Padding and stride can be used to adjust the dimensionality of the data effectively. 


Exercises 
1. For the last example in this section, use mathematics to calculate the output shape to see if 
it is consistent with the experimental result. 
2. Try other padding and stride combinations on the experiments in this section. 
3. For audio signals, what does a stride of 2 correspond to? 
4, What are the computational benefits of a stride larger than 1? 


Discussions?” 
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6.4 Multiple Input and Multiple Output Channels 


While we have described the multiple channels that comprise each image (e.g., color images have 
the standard RGB channels to indicate the amount of red, green and blue) and convolutional layers 
for multiple channels in Section 6.1.4, until now, we simplified all of our numerical examples by 
working with just a single input and a single output channel. This has allowed us to think of our 
inputs, convolution kernels, and outputs each as two-dimensional tensors. 


When we add channels into the mix, our inputs and hidden representations both become three- 
dimensional tensors. For example, each RGB input image has shape 3 x h x w. We refer to this axis, 
with a size of 3, as the channel dimension. In this section, we will take a deeper look at convolution 
kernels with multiple input and multiple output channels. 


6.4.1 Multiple Input Channels 


When the input data contain multiple channels, we need to construct a convolution kernel with 
the same number of input channels as the input data, so that it can perform cross-correlation with 
the input data. Assuming that the number of channels for the input data is c;, the number of input 
channels of the convolution kernel also needs to be c;. If our convolution kernel's window shape 
is kp X ky, then when c; = 1, we can think of our convolution kernel as just a two-dimensional 
tensor of shape kp x ky. 


However, when c; > 1, we need a kernel that contains a tensor of shape kp x kw for every input 
channel. Concatenating these c; tensors together yields a convolution kernel of shape c; x kp x kw. 
Since the input and convolution kernel each have c; channels, we can perform a cross-correlation 
operation on the two-dimensional tensor of the input and the two-dimensional tensor of the con- 
volution kernel for each channel, adding the c; results together (summing over the channels) to 
yield a two-dimensional tensor. This is the result of a two-dimensional cross-correlation between 
a multi-channel input and a multi-input-channel convolution kernel. 


In Fig. 6.4.1, we demonstrate an example of a two-dimensional cross-correlation with two input 
channels. The shaded portions are the first output element as well as the input and kernel tensor 
elements used for the output computation: (1x 1+2x2+4x3+5x4)+(0x0+1x1+3x2+4x3) = 56. 


Input Kernel Input Kernel Output 





Fig. 6.4.1: Cross-correlation computation with 2 input channels. 


To make sure we really understand what is going on here, we can implement cross-correlation 
operations with multiple input channels ourselves. Notice that all we are doing is performing one 
cross-correlation operation per channel and then adding up the results. 
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from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


def corr2d_multi_in(X, K): 
# First, iterate through the @th dimension (channel dimension) of ‘X* and 
# ‘K*. Then, add them together 
return sum(d21.corr2d(x, k) for x, k in zip(X, K)) 


We can construct the input tensor X and the kernel tensor K corresponding to the values in Fig. 
6.4.1 to validate the output of the cross-correlation operation. 


X = np.array([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]], 
CEO, 2.0, 3.01, MALO, So, Geel, 40, 3,0, 91110 
K = np.array([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]]) 


corr2d_multi_in(X, K) 


array IZA 
[104., 120.11) 


6.4.2 Multiple Output Channels 


Regardless of the number of input channels, so far we always ended up with one output channel. 
However, as we discussed in Section 6.1.4, it turns out to be essential to have multiple channels 
at each layer. In the most popular neural network architectures, we actually increase the channel 
dimension as we go higher up in the neural network, typically downsampling to trade off spatial 
resolution for greater channel depth. Intuitively, you could think of each channel as responding to 
some different set of features. Reality is a bit more complicated than the most naive interpreta- 
tions of this intuition since representations are not learned independent but are rather optimized 
to be jointly useful. So it may not be that a single channel learns an edge detector but rather that 
some direction in channel space corresponds to detecting edges. 


Denote by c; and co the number of input and output channels, respectively, and let kp and kw 
be the height and width of the kernel. To get an output with multiple channels, we can create a 
kernel tensor of shape c; x kn x ky for every output channel. We concatenate them on the output 
channel dimension, so that the shape of the convolution kernel is co x c; x kp X ky. In cross- 
correlation operations, the result on each output channel is calculated from the convolution kernel 
corresponding to that output channel and takes input from all channels in the input tensor. 


We implement a cross-correlation function to calculate the output of multiple channels as shown 
below. 


def corr2d_multi_in_out(X, K): 
# Iterate through the 0th dimension of 'K*, and each time, perform 
# cross-correlation operations with input ‘X*. All of the results are 
# stacked together 
return np.stack([corr2d_multi_in(X, k) for k in K], 0) 


We construct a convolution kernel with 3 output channels by concatenating the kernel tensor K 
with K+1 (plus one for each element in K) and K+2. 
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K = np.stack((K, K + 1, K + 2), 0) 
K.shape 


(3, 2, 2, 2) 


Below, we perform cross-correlation operations on the input tensor X with the kernel tensor K. 
Now the output contains 3 channels. The result of the first channel is consistent with the result of 
the previous input tensor X and the multi-input channel, single-output channel kernel. 


corr2d_multi_in_out(X, K) 





array([[[ 56., 72.1, 
[104., 120.11, 
CE 76., 100.], 
EB y 172,11). 
IE 3., 128.1. 
[192., 224.111) 





6.4.3 1 x 1 Convolutional Layer 


At first, a 1 x 1 convolution, i.e., ka = ky = 1, does not seem to make much sense. After all, a 
convolution correlates adjacent pixels. A 1 x 1 convolution obviously does not. Nonetheless, they 
are popular operations that are sometimes included in the designs of complex deep networks. Let 
us see in some detail what it actually does. 


Because the minimum window is used, the 1 x 1 convolution loses the ability of larger convo- 
lutional layers to recognize patterns consisting of interactions among adjacent elements in the 
height and width dimensions. The only computation of the 1 x 1 convolution occurs on the chan- 
nel dimension. 


Fig. 6.4.2 shows the cross-correlation computation using the 1 x 1 convolution kernel with 3 input 
channels and 2 output channels. Note that the inputs and outputs have the same height and width. 
Each element in the output is derived from a linear combination of elements at the same position in 
the input image. You could think of the 1 x 1 convolutional layer as constituting a fully-connected 
layer applied at every single pixel location to transform the c; corresponding input values into c, 
output values. Because this is still a convolutional layer, the weights are tied across pixel location. 
Thus the 1 x 1 convolutional layer requires c, x c; weights (plus the bias). 


Input Kernel Output 





Fig. 6.4.2: The cross-correlation computation uses the 1 x 1 convolution kernel with 3 input chan- 
nels and 2 output channels. The input and output have the same height and width. 
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Let us check whether this works in practice: we implement a 1 x 1 convolution using a fully- 
connected layer. The only thing is that we need to make some adjustments to the data shape 
before and after the matrix multiplication. 


def corr2d_multi_in_out_1x1(X, K): 
c_i, h, w = X.shape 
c_o = K.shape[Q] 
X = X.reshape((c_i, h * w)) 
K = K.reshape((c_o, c_i)) 
Y = np.dot(K, X) # Matrix multiplication in the fully-connected layer 
return Y.reshape((c_o, h, w)) 


When performing 1 x 1 convolution, the above function is equivalent to the previously imple- 
mented cross-correlation function corr2d_multi_in_out. Let us check this with some sample 
data. 


X = np.random.normal(0, 1, (3, 3, 3)) 
np.random.normal(0, 1, (2, 3, 1, 1)) 


A 
ll 


Y1 = corr2d_multi_in_out_1x1(X, K) 
Y2 = corr2d_multi_in_out(X, K) 
assert float(np.abs(Y1 - Y2).sum()) < le-6 


Summary 


e Multiple channels can be used to extend the model parameters of the convolutional layer. 


e The 1 x 1 convolutional layer is equivalent to the fully-connected layer, when applied on a 
per pixel basis. 


e The 1 x 1 convolutional layer is typically used to adjust the number of channels between 
network layers and to control model complexity. 


Exercises 
1. Assume that we have two convolution kernels of size kı and ka, respectively (with no non- 
linearity in between). 
1. Prove that the result of the operation can be expressed by a single convolution. 
2. What is the dimensionality of the equivalent single convolution? 
3. Is the converse true? 


2. Assume an input of shape c; x h x w anda convolution kernel of shape co x ci X kp X kw, 
padding of (pp, pw), and stride of (sn, Sw). 


1. What is the computational cost (multiplications and additions) for the forward propa- 
gation? 


2. What is the memory footprint? 
3. What is the memory footprint for the backward computation? 


4. What is the computational cost for the backpropagation? 
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3. By what factor does the number of calculations increase if we double the number of input 
channels c; and the number of output channels c,? What happens if we double the padding? 


4. If the height and width of a convolution kernel is kp = kw = 1, what is the computational 
complexity of the forward propagation? 


5. Are the variables Y1 and Y2 in the last example of this section exactly the same? Why? 


6. How would you implement convolutions using matrix multiplication when the convolution 
window is not 1 x 1? 


Discussions? 


6.5 Pooling 


Often, as we process images, we want to gradually reduce the spatial resolution of our hidden 
representations, aggregating information so that the higher up we go in the network, the larger 
the receptive field (in the input) to which each hidden node is sensitive. 


Often our ultimate task asks some global question about the image, e.g., does it contain a cat? So 
typically the units of our final layer should be sensitive to the entire input. By gradually aggregat- 
inginformation, yielding coarser and coarser maps, we accomplish this goal of ultimately learning 
a global representation, while keeping all of the advantages of convolutional layers at the inter- 
mediate layers of processing. 


Moreover, when detecting lower-level features, such as edges (as discussed in Section 6.2), we 
often want our representations to be somewhat invariant to translation. For instance, if we take 
the image X with a sharp delineation between black and white and shift the whole image by one 
pixel to the right, i.e., Z[i, j] = X[i, j + 1], then the output for the new image Z might be vastly 
different. The edge will have shifted by one pixel. In reality, objects hardly ever occur exactly at 
the same place. In fact, even with a tripod and a stationary object, vibration of the camera due to 
the movement of the shutter might shift everything by a pixel or so (high-end cameras are loaded 
with special features to address this problem). 


This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity 
of convolutional layers to location and of spatially downsampling representations. 


6.5.1 Maximum Pooling and Average Pooling 


Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all 
regions in the input according to its stride, computing a single output for each location traversed 
by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross- 
correlation computation of the inputs and kernels in the convolutional layer, the pooling layer 
contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically 
calculating eitherthe maximum orthe average value ofthe elements in the pooling window. These 
operations are called maximum pooling (max pooling for short) and average pooling, respectively. 


In both cases, as with the cross-correlation operator, we can think of the pooling window as start- 
ing from the top left of the input tensor and sliding across the input tensor from left to right and 
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top to bottom. At each location that the pooling window hits, it computes the maximum or aver- 
age value of the input subtensor in the window, depending on whether max or average pooling is 
employed. 


Output 





Fig. 6.5.1: Maximum pooling with a pooling window shape of 2 x 2. The shaded portions are 
the first output element as well as the input tensor elements used for the output computation: 
max(0, 1,3,4) = 4. 


The output tensor in Fig. 6.5.1 has a height of 2 and a width of 2. The four elements are derived 
from the maximum value in each pooling window: 


max(0, 1,3,4) = 4, 
max(1, 2,4,5) = 5, (6.5.1) 
max(3, 4, 6,7) = 7, 
max(4,5,7,8) =8 


A pooling layer with a pooling window shape of p x q is called a p x q pooling layer. The pooling 
operation is called p x q pooling. 


Let us return to the object edge detection example mentioned at the beginning of this section. 
Now we will use the output of the convolutional layer as the input for 2 x 2 maximum pooling. Set 
the convolutional layer input as X and the pooling layer output as Y. Whether or not the values of 
X[i, jljandX[i, j + 1] are different, or X[i, j + 1] andX[i, j + 2] are different, the pooling 
layer always outputs Y[i, j] = 1. That is to say, using the 2 x 2 maximum pooling layer, we can 
still detect if the pattern recognized by the convolutional layer moves no more than one element 
in height or width. 


In the code below, we implement the forward propagation of the pooling layer in the pool2d func- 
tion. This function is similar to the corr2d function in Section 6.2. However, here we have no 
kernel, computing the output as either the maximum or the average of each region in the input. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


def pool2d(X, pool_size, mode='max’): 
p_h, p_w = pool_size 
Y = np.zeros((X.shapelQ@] - p_h + 1, X.shape[1] - p_w + 1)) 
for i in range(Y.shape[Q]): 
for j in range(Y.shape[1]): 


if mode == 'max': 
Yi, j] = XLi: i + ph, j: j + p_wl.max() 
elif mode == 'avg': 


YCi, j] = Xfi: i + p_h, j: j + p_w].mean() 
return Y 
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We can construct the input tensor X in Fig. 6.5.1 to validate the output of the two-dimensional 
maximum pooling layer. 


X = np.array([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) 
pool2d(X, (2, 2)) 


array([[4., 5.1, 
Toy So) 


Also, we experiment with the average pooling layer. 


pool2d(X, (2, 2), 'avg') 


ancay UE. Soll, 
[5.5 6-11) 


6.5.2 Padding and Stride 


As with convolutional layers, pooling layers can also change the output shape. And as before, we 
can alter the operation to achieve a desired output shape by padding the input and adjusting the 
stride. We can demonstrate the use of padding and strides in pooling layers via the built-in two- 
dimensional maximum pooling layer from the deep learning framework. We first construct an 
input tensor X whose shape has four dimensions, where the number of examples and number of 
channels are both 1. 


X = np.arange(16, dtype=np.float32).reshape((1, 1, 4, 4)) 
X 


array MENER S eg Sed 
Eio Ga Goa Ml; 
E ae Oar Waa lial 
EZ. y Mo Wh. ST 


By default, the stride and the pooling window in the instance from the framework's built-in class 
have the same shape. Below, we use a pooling window of shape (3, 3), so we get a stride shape 
of (3, 3) by default. 


pool2d = nn.MaxPool2D(3) 
# Because there are no model parameters in the pooling layer, we do not need 


# to call the parameter initialization function 
pool2d(X) 


array([[[[10.111)) 


The stride and padding can be manually specified. 


pool2d = nn.MaxPool2D(3, padding=1, strides=2) 
pool2d(X) 
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Of course, we can specify an arbitrary rectangular pooling window and specify the padding and 
stride for height and width, respectively. 


pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3)) 
pool2d(X) 


array([E[[[ 0., 3.1, 
EE Bay Mill, 
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6.5.3 Multiple Channels 


When processing multi-channel input data, the pooling layer pools each input channel separately, 
rather than summing the inputs up over channels as in a convolutional layer. This means that the 
number of output channels for the pooling layer is the same as the number of input channels. 
Below, we will concatenate tensors Xand X + 1 on the channel dimension to construct an input 
with 2 channels. 


X = np.concatenate((X, X + 1), 1) 
X 
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As we can see, the number of output channels is still 2 after pooling. 
pool2d = nn.MaxPool2D(3, padding=1, strides=2) 

pool2d(X) 
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Summary 


Taking the input elements in the pooling window, the maximum pooling operation assigns 
the maximum value as the output and the average pooling operation assigns the average 
value as the output. 


One of the major benefits of a pooling layer is to alleviate the excessive sensitivity of the 
convolutional layer to location. 


We can specify the padding and stride for the pooling layer. 


Maximum pooling, combined with a stride larger than 1 can be used to reduce the spatial 
dimensions (e.g., width and height). 


The pooling layer's number of output channels isthe same as the number of input channels. 


Exercises 


1. Can you implement average pooling as a special case of a convolution layer? If so, do it. 
2. Can you implement max pooling as a special case of a convolution layer? If so, do it. 


3. What is the computational cost of the pooling layer? Assume that the input to the pooling 
layer is of size cx h x w, the pooling window has a shape of pp, x py with a padding of (ph, Pw) 
and a stride of (sn, Sw). 


4. Why do you expect maximum pooling and average pooling to work differently? 
5. Do we need a separate minimum pooling layer? Can you replace it with another operation? 


6. Is there another operation between average and maximum pooling that you could consider 
(hint: recall the softmax)? Why might it not be so popular? 


Discussions®? 


6.6 Convolutional Neural Networks (LeNet) 


We now have all the ingredients required to assemble a fully-functional CNN. In our earlier en- 
counter with image data, we applied a softmax regression model (Section 3.6) and an MLP model 
(Section 4.2) to pictures of clothing in the Fashion-MNIST dataset. To make such data amenable 
to softmax regression and MLPs, we first flattened each image from a 28 x 28 matrix into a fixed- 
length 784-dimensional vector, and thereafter processed them with fully-connected layers. Now 
that we have a handle on convolutional layers, we can retain the spatial structure in our images. As 
an additional benefit of replacing fully-connected layers with convolutional layers, we will enjoy 
more parsimonious models that require far fewer parameters. 


In this section, we will introduce LeNet, among the first published CNNs to capture wide attention 
for its performance on computer vision tasks. The model was introduced by (and named for) Yann 
LeCun, then a researcher at AT&T Bell Labs, for the purpose of recognizing handwritten digits 
in images (LeCun et al., 1998). This work represented the culmination of a decade of research 
developing the technology. In 1989, LeCun published the first study to successfully train CNNs via 
backpropagation. 





$2 https://discuss.d21.ai/t/71 
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At the time LeNet achieved outstanding results matching the performance of support vector ma- 
chines, then a dominant approach in supervised learning. LeNet was eventually adapted to rec- 
ognize digits for processing deposits in ATM machines. To this day, some ATMs still run the code 
that Yann and his colleague Leon Bottou wrote in the 1990s! 


6.6.1 LeNet 


At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of 
two convolutional layers; and (ii) a dense block consisting of three fully-connected layers; The 
architecture is summarized in Fig. 6.6.1. 
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Fig. 6.6.1: Data flow in LeNet. The input is a handwritten digit, the output a probability over 10 
possible outcomes. 


The basic units in each convolutional block are a convolutional layer, a sigmoid activation func- 
tion, and a subsequent average pooling operation. Note that while ReLUs and max-pooling work 
better, these discoveries had not yet been made in the 1990s. Each convolutional layer uses a 5 x 5 
kernel and a sigmoid activation function. These layers map spatially arranged inputs to a number 
of two-dimensional feature maps, typically increasing the number of channels. The first convolu- 
tional layer has 6 output channels, while the second has 16. Each 2 x 2 pooling operation (stride 2) 
reduces dimensionality by a factor of 4 via spatial downsampling. The convolutional block emits 
an output with shape given by (batch size, number of channel, height, width). 


In order to pass output from the convolutional block to the dense block, we must flatten each 
example in the minibatch. In other words, we take this four-dimensional input and transform 
it into the two-dimensional input expected by fully-connected layers: as a reminder, the two- 
dimensional representation that we desire has uses the first dimension to index examples in the 
minibatch and the second to give the flat vector representation of each example. LeNet’s dense 
block has three fully-connected layers, with 120, 84, and 10 outputs, respectively. Because we 
are still performing classification, the 10-dimensional output layer corresponds to the number of 
possible output classes. 


While getting to the point where you truly understand what is going on inside LeNet may have 
taken a bit of work, hopefully the following code snippet will convince you that implementing such 
models with modern deep learning frameworks is remarkably simple. We need only to instantiate 
a Sequential block and chain together the appropriate layers. 
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from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


net = nn.Sequential() 

net .add(nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='sigmoid'), 
nn.AvgPool2D(pool_size=2, strides=2), 
nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'), 
nn.AvgPool2D(pool_size=2, strides=2), 
# ‘Dense* will transform an input of the shape (batch size, number of 
# channels, height, width) into an input of the shape (batch size, 
# number of channels * height * width) automatically by default 
nn.Dense(120, activation='sigmoid’), 
nn.Dense(84, activation='sigmoid’), 
nn.Dense(10)) 


We took a small liberty with the original model, removing the Gaussian activation in the final 
layer. Other than that, this network matches the original LeNet-5 architecture. 


By passing a single-channel (black and white) 28 x 28 image through the network and printing the 
output shape at each layer, we can inspect the model to make sure that its operations line up with 
what we expect from Fig. 6.6.2. 


5 x 5 Conv (16) 


5 x 5 Conv (6), pad 2 
Image (28 x 28) 


Fig. 6.6.2: Compressed notation for LeNet-5. 





X = np.random.uniform(size=(1, 1, 28, 28)) 
net.initialize() 
for layer in net: 

X = layer(X) 

print(layer.name, ‘output shape:\t', X.shape) 
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conv@ output shape: (1, 6, 28, 28) 
pool@ output shape: (1, 6, 14, 14) 
conv1 output shape: (1, 16, 10, 10) 
pooll output shape: (1, 16, 5, 5) 


dense0 output shape: (G22) 
densel output shape: (1, 84) 
dense2 output shape: (1, 10) 


Note that the height and width of the representation at each layer throughout the convolutional 
block is reduced (compared with the previous layer). The first convolutional layer uses 2 pixels of 
padding to compensate for the reduction in height and width that would otherwise result from us- 
inga 5x5 kernel. In contrast, the second convolutional layer forgoes padding, and thus the height 
and width are both reduced by 4 pixels. As we go up the stack of layers, the number of channels 
increases layer-over-layer from 1 in the input to 6 after the first convolutional layer and 16 after 
the second convolutional layer. However, each pooling layer halves the height and width. Finally, 
each fully-connected layer reduces dimensionality, finally emitting an output whose dimension 
matches the number of classes. 


6.6.2 Training 


Now that we have implemented the model, let us run an experiment to see how LeNet fares on 
Fashion-MNIST. 


batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size=batch_size) 


While CNNs have fewer parameters, they can still be more expensive to compute than similarly 
deep MLPs because each parameter participates in many more multiplications. If you have access 
to a GPU, this might be a good time to put it into action to speed up training. 


For evaluation, we need to make a slight modification to the evaluate_accuracy function that we 
described in Section 3.6. Since the full dataset is in the main memory, we need to copy it to the 
GPU memory before the model uses GPU to compute with the dataset. 


def evaluate_accuracy_gpu(net, data_iter, device=None): #@save 

"""Compute the accuracy for a model on a dataset using a GPU.”"” 

if not device: # Query the first device where the first parameter is on 
device = list(net.collect_params() .values())[0].list_ctx() [0] 

# No. of correct predictions, no. of predictions 

metric = d21.Accumulator (2) 

for X, y in data_iter: 
X, y = X.as_in_ctx(device), y.as_in_ctx(device) 
metric.add(d21.accuracy(net(X), y), y.size) 

return metric[0] / metric[1] 


We also need to update our training function to deal with GPUs. Unlike the train_epoch_ch3 de- 
fined in Section 3.6, we now need to move each minibatch of data to our designated device (hope- 
fully, the GPU) prior to making the forward and backward propagations. 


The training function train_ch6 is also similar to train_ch3 defined in Section 3.6. Since we will 
be implementing networks with many layers going forward, we will rely primarily on high-level 
APIs. The following training function assumes a model created from high-level APIs as input and 
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is optimized accordingly. We initialize the model parameters on the device indicated bythe device 
argument, using Xavier initialization as introduced in Section 4.8.2. Just as with MLPs, our loss 
function is cross-entropy, and we minimize it via minibatch stochastic gradient descent. Since 
each epoch takes tens of seconds to run, we visualize the training loss more frequently. 


#@save 
def train_ch6(net, train_iter, test_iter, num_epochs, lr, 
device=d21.try_gpu()): 
"""Train a model with a GPU (defined in Chapter 6).””” 
net. initialize(force_reinit=True, ctx=device, init=init.Xavier()) 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 
trainer = gluon.Trainer(net.collect_params(), 
'sgd', {'’learning_rate’: 1r)) 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], 
legend=['train loss', 'train acc’, ‘test acc’]) 
timer, num_batches = d21.Timer(), len(train_iter) 
for epoch in range(num_epochs): 
# Sum of training loss, sum of training accuracy, no. of examples 
metric = d21.Accumulator (3) 
for i, (X, y) in enumerate(train_iter): 
timer.start() 
# Here is the major difference from ‘d21.train_epoch_ch3* 
X, y = X.as_in_ctx(device), y.as_in_ctx(device) 
with autograd.record(): 
y_hat = net(X) 
1 = loss(y_hat, y) 
1. backward() 
trainer.step(X.shape[0]) 
metric.add(1.sum(), d21.accuracy(y_hat, y), X.shape[0]) 
timer.stop() 
train_l = metric[0] / metric[2] 
train_acc = metric[1] / metric[2] 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(train_1, train_acc, None)) 
test_acc = evaluate_accuracy_gpu(net, test_iter) 
animator.add(epoch + 1, (None, None, test_acc)) 
print(f'loss (train_1:.3f), train acc {train_acc: .3f}, 
fetesiqwaccaatestwacce oi jam) 
print(f'(metric[2] * num_epochs / timer.sum():.1f} examples/sec 
f'on (str(device))') 


, 


1 


Now let us train and evaluate the LeNet-5 model. 


lr, num_epochs = 0.9, 10 
train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.473, train acc 0.822, test acc 0.802 
47612.3 examples/sec on gpu(Q) 
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—— train loss 
=== train acc 
—-= test acc 





Summary 


+ A CNN is a network that employs convolutional layers. 
e In a CNN, we interleave convolutions, nonlinearities, and (often) pooling operations. 


e In a CNN, convolutional layers are typically arranged so that they gradually decrease the 
spatial resolution of the representations, while increasing the number of channels. 


+ In traditional CNNs, the representations encoded by the convolutional blocks are processed 
by one or more fully-connected layers prior to emitting output. 


e LeNet was arguably the first successful deployment of such a network. 


Exercises 


1. Replace the average pooling with max pooling. What happens? 
2. Try to construct a more complex network based on LeNet to improve its accuracy. 
1. Adjust the convolution window size. 
. Adjust the number of output channels. 
. Adjust the activation function (e.g., ReLU). 


2 

3 

4, Adjust the number of convolution layers. 

5. Adjust the number of fully connected layers. 
6 


. Adjust the learning rates and other training details (e.g., initialization and number of 
epochs.) 


3. Try out the improved network on the original MNIST dataset. 


4. Display the activations of the first and second layer of LeNet for different inputs (e.g., 
sweaters and coats). 


Discussions” 





» https://discuss.d21.ai/t/73 
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7 Modern Convolutional Neural Net- 
works 


Now that we understand the basics of wiring together CNNs, we will take you through a tour of 
modern CNN architectures. In this chapter, each section corresponds to a significant CNN archi- 
tecture that was at some point (or currently) the base model upon which many research projects 
and deployed systems were built. Each of these networks was briefly a dominant architecture and 
many were winners or runners-up in the ImageNet competition, which has served as a barometer 
of progress on supervised learning in computer vision since 2010. 


These models include AlexNet, the first large-scale network deployed to beat conventional com- 
puter vision methods on a large-scale vision challenge; the VGG network, which makes use of a 
number of repeating blocks of elements; the network in network (NiN) which convolves whole 
neural networks patch-wise over inputs; GoogLeNet, which uses networks with parallel concate- 
nations; residual networks (ResNet), which remain the most popular off-the-shelf architecture in 
computer vision; and densely connected networks (DenseNet), which are expensive to compute 
but have set some recent benchmarks. 


While the idea of deep neural networks is quite simple (stack together a bunch of layers), perfor- 
mance can vary wildly across architectures and hyperparameter choices. The neural networks 
described in this chapter are the product of intuition, a few mathematical insights, and a whole 
lot of trial and error. We present these models in chronological order, partly to convey a sense 
of the history so that you can form your own intuitions about where the field is heading and per- 
haps develop your own architectures. For instance, batch normalization and residual connections 
described in this chapter have offered two popular ideas for training and designing deep models. 


7.1 Deep Convolutional Neural Networks (AlexNet) 


Although CNNs were well known in the computer vision and machine learning communities fol- 
lowing the introduction of LeNet, they did not immediately dominate the field. Although LeNet 
achieved good results on early small datasets, the performance and feasibility of training CNNs on 
larger, more realistic datasets had yet to be established. In fact, for much of the intervening time 
between the early 1990s and the watershed results of 2012, neural networks were often surpassed 
by other machine learning methods, such as support vector machines. 


For computer vision, this comparison is perhaps not fair. That is although the inputs to convolu- 
tional networks consist of raw or lightly-processed (e.g., by centering) pixel values, practitioners 
would never feed raw pixels into traditional models. Instead, typical computer vision pipelines 
consisted of manually engineering feature extraction pipelines. Rather than learn the features, the 
features were crafted. Most of the progress came from having more clever ideas for features, and 
the learning algorithm was often relegated to an afterthought. 





255 


Although some neural network accelerators were available in the 1990s, they were not yet suffi- 
ciently powerful to make deep multichannel, multilayer CNNs with a large number of parameters. 
Moreover, datasets were still relatively small. Added to these obstacles, key tricks for training 
neural networks including parameter initialization heuristics, clever variants of stochastic gra- 
dient descent, non-squashing activation functions, and effective regularization techniques were 
still missing. 


Thus, rather than training end-to-end (pixel to classification) systems, classical pipelines looked 
more like this: 


1. Obtain an interesting dataset. In early days, these datasets required expensive sensors (at 
the time, 1 megapixel images were state-of-the-art). 


2. Preprocess the dataset with hand-crafted features based on some knowledge of optics, geom- 
etry, other analytic tools, and occasionally on the serendipitous discoveries of lucky gradu- 
ate students. 


3. Feed the data through a standard set of feature extractors such as the SIFT (scale-invariant 
feature transform) (Lowe, 2004), the SURF (speeded up robust features) (Bay et al., 2006), or 
any number of other hand-tuned pipelines. 


4. Dump the resulting representations into your favorite classifier, likely a linear model or ker- 
nel method, to train a classifier. 


If you spoke to machine learning researchers, they believed that machine learning was both im- 
portant and beautiful. Elegant theories proved the properties of various classifiers. The field of 
machine learning was thriving, rigorous, and eminently useful. However, if you spoke to a com- 
puter vision researcher, you would hear a very different story. The dirty truth of image recogni- 
tion, they would tell you, is that features, not learning algorithms, drove progress. Computer vi- 
sion researchers justifiably believed that a slightly bigger or cleaner dataset or a slightly improved 
feature-extraction pipeline mattered far more to the final accuracy than any learning algorithm. 


7.1.1 Learning Representations 


Another way to cast the state of affairs is that the most important part of the pipeline was the rep- 
resentation. And up until 2012 the representation was calculated mechanically. In fact, engineer- 
ing a new set of feature functions, improving results, and writing up the method was a prominent 
genre of paper. SIFT (Lowe, 2004), SURF (Bay et al., 2006), HOG (histograms of oriented gradient) 
(Dalal & Triggs, 2005), bags of visual words”! and similar feature extractors ruled the roost. 


Another group of researchers, including Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng, 
Shun-ichi Amari, and Juergen Schmidhuber, had different plans. They believed that features 
themselves ought to be learned. Moreover, they believed that to be reasonably complex, the fea- 
tures ought to be hierarchically composed with multiple jointly learned layers, each with learn- 
able parameters. In the case of an image, the lowest layers might come to detect edges, colors, and 
textures. Indeed, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton proposed a new variant of a 
CNN, AlexNet, that achieved excellent performance in the 2012 ImageNet challenge. AlexNet was 
named after Alex Krizhevsky, the first author of the breakthrough ImageNet classification paper 
(Krizhevsky et al., 2012). 


Interestingly in the lowest layers of the network, the model learned feature extractors that resem- 
bled some traditional filters. Fig. 7.1.1 is reproduced from the AlexNet paper (Krizhevsky et al., 
2012) and describes lower-level image descriptors. 





l https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision 
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Fig. 7.1.1: Image filters learned by the first layer of AlexNet. 


Higher layers in the network might build upon these representations to represent larger struc- 
tures, like eyes, noses, blades of grass, and so on. Even higher layers might represent whole ob- 
jects like people, airplanes, dogs, or frisbees. Ultimately, the final hidden state learns a compact 
representation of the image that summarizes its contents such that data belonging to different 
categories can be easily separated. 


While the ultimate breakthrough for many-layered CNNs came in 2012, a core group of researchers 
had dedicated themselves to this idea, attempting to learn hierarchical representations of visual 
data for many years. The ultimate breakthrough in 2012 can be attributed to two key factors. 


Missing Ingredient: Data 


Deep models with many layers require large amounts of data in order to enter the regime where 
they significantly outperform traditional methods based on convex optimizations (e.g., linear and 
kernel methods). However, given the limited storage capacity of computers, the relative expense 
of sensors, and the comparatively tighter research budgets in the 1990s, most research relied on 
tiny datasets. Numerous papers addressed the UCI collection of datasets, many of which contained 
only hundreds or (a few) thousands of images captured in unnatural settings with low resolution. 


In 2009, the ImageNet dataset was released, challenging researchers to learn models from 1 mil- 
lion examples, 1000 each from 1000 distinct categories of objects. The researchers, led by Fei-Fei 
Li, who introduced this dataset leveraged Google Image Search to prefilter large candidate sets for 
each category and employed the Amazon Mechanical Turk crowdsourcing pipeline to confirm for 
each image whether it belonged to the associated category. This scale was unprecedented. The 
associated competition, dubbed the ImageNet Challenge pushed computer vision and machine 
learning research forward, challenging researchers to identify which models performed best at a 
greater scale than academics had previously considered. 
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Missing Ingredient: Hardware 


Deep learning models are voracious consumers of compute cycles. Training can take hundreds 
of epochs, and each iteration requires passing data through many layers of computationally- 
expensive linear algebra operations. This is one of the main reasons why in the 1990s and early 
2000s, simple algorithms based on the more-efficiently optimized convex objectives were pre- 
ferred. 


Graphical processing units (GPUs) proved to be a game changer in making deep learning feasible. 
These chips had long been developed for accelerating graphics processing to benefit computer 
games. In particular, they were optimized for high throughput 4 x 4 matrix-vector products, which 
are needed for many computer graphics tasks. Fortunately, this math is strikingly similar to that 
required to calculate convolutional layers. Around that time, NVIDIA and ATI had begun optimiz- 
ing GPUs for general computing operations, going as far as to market them as general-purpose GPUs 
(GPGPU). 


To provide some intuition, consider the cores of a modern microprocessor (CPU). Each of the 
cores is fairly powerful running at a high clock frequency and sporting large caches (up to several 
megabytes of L3). Each core is well-suited to executing a wide range of instructions, with branch 
predictors, a deep pipeline, and other bells and whistles that enable it to run a large variety of 
programs. This apparent strength, however, is also its Achilles heel: general-purpose cores are 
very expensive to build. They require lots of chip area, a sophisticated support structure (memory 
interfaces, caching logic between cores, high-speed interconnects, and so on), and they are com- 
paratively bad at any single task. Modern laptops have up to 4 cores, and even high-end servers 
rarely exceed 64 cores, simply because it is not cost effective. 


By comparison, GPUs consist of 100 ~ 1000 small processing elements (the details differ some- 
what between NVIDIA, ATI, ARM and other chip vendors), often grouped into larger groups 
(NVIDIA calls them warps). While each core is relatively weak, sometimes even running at sub- 
1GHz clock frequency, it is the total number of such cores that makes GPUs orders of magnitude 
faster than CPUs. For instance, NVIDIA's recent Volta generation offers up to 120 TFlops per chip 
for specialized instructions (and up to 24 TFlops for more general-purpose ones), while floating 
point performance of CPUs has not exceeded 1 TFlop to date. The reason for why this is possible is 
actually quite simple: first, power consumption tends to grow quadratically with clock frequency. 
Hence, for the power budget of a CPU core that runs 4 times faster (a typical number), you can use 
16 GPU cores at 1/4 the speed, which yields 16 x 1/4 = 4 times the performance. Furthermore, 
GPU cores are much simpler (in fact, for a long time they were not even able to execute general- 
purpose code), which makes them more energy efficient. Last, many operations in deep learning 
require high memory bandwidth. Again, GPUs shine here with buses that are at least 10 times as 
wide as many CPUs. 


Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya Sutskever implemented 
a deep CNN that could run on GPU hardware. They realized that the computational bottlenecks 
in CNNs, convolutions and matrix multiplications, are all operations that could be parallelized in 
hardware. Using two NVIDIA GTX 580s with 3GB of memory, they implemented fast convolutions. 
The code cuda-convnet”? was good enough that for several years it was the industry standard and 
powered the first couple years of the deep learning boom. 





2 https://code.google.com/archive/p/cuda-convnet/ 
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7.1.2 AlexNet 


AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition Chal- 
lenge 2012 by a phenomenally large margin. This network showed, for the first time, that the 
features obtained by learning can transcend manually-designed features, breaking the previous 
paradigm in computer vision. 


The architectures of AlexNet and LeNet are very similar, as Fig. 7.1.2 illustrates. Note that we 
provide a slightly streamlined version of AlexNet removing some of the design quirks that were 
needed in 2012 to make the model fit on two small GPUs. 


FC (1000) 


FC (10) 


3 x 3 Conv (384), pad 1 
3 x 3 Conv (384), pad 1 
FC (120) 3 x 3 Conv (384), pad 1 


FC (84) 


ft 


5 x 5 Conv (16) 


5 x 5 Conv (256), pad 2 


5 x 5 Conv (6), pad 2 


i 


11 x 11 Conv (96), stride 4 
Image (28 x 28) Image (3 x 224 x 224) 


Fig. 7.1.2: From LeNet (left) to AlexNet (right). 





The design philosophies of AlexNet and LeNet are very similar, but there are also significant dif- 
ferences. First, AlexNet is much deeper than the comparatively small LeNet5. AlexNet consists of 
eight layers: five convolutional layers, two fully-connected hidden layers, and one fully-connected 
output layer. Second, AlexNet used the ReLU instead of the sigmoid as its activation function. Let 
us delve into the details below. 
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Architecture 


In AlexNet's first layer, the convolution window shape is 11 x 11. Since most images in ImageNet 
are more than ten times higher and wider than the MNIST images, objects in ImageNet data tend 
to occupy more pixels. Consequently, a larger convolution window is needed to capture the object. 
The convolution window shape in the second layer is reduced to 5x5, followed by 3x3. In addition, 
after the first, second, and fifth convolutional layers, the network adds maximum pooling layers 
with a window shape of 3 x 3 and a stride of 2. Moreover, AlexNet has ten times more convolution 
channels than LeNet. 


After the last convolutional layer there are two fully-connected layers with 4096 outputs. These 
two huge fully-connected layers produce model parameters of nearly 1 GB. Due to the limited 
memory in early GPUs, the original AlexNet used a dual data stream design, so that each of their 
two GPUs could be responsible for storing and computing only its half of the model. Fortunately, 
GPU memory is comparatively abundant now, so we rarely need to break up models across GPUs 
these days (our version of the AlexNet model deviates from the original paper in this aspect). 


Activation Functions 


Besides, AlexNet changed the sigmoid activation function to a simpler ReLU activation function. 
On one hand, the computation of the ReLU activation function is simpler. For example, it does 
not have the exponentiation operation found in the sigmoid activation function. On the other 
hand, the ReLU activation function makes model training easier when using different parameter 
initialization methods. This is because, when the output of the sigmoid activation function is very 
close to 0 or 1, the gradient of these regions is almost 0, so that backpropagation cannot continue 
to update some of the model parameters. In contrast, the gradient of the ReLU activation function 
in the positive interval is always 1. Therefore, if the model parameters are not properly initialized, 
the sigmoid function may obtain a gradient of almost 0 in the positive interval, so that the model 
cannot be effectively trained. 


Capacity Control and Preprocessing 


AlexNet controls the model complexity of the fully-connected layer by dropout (Section 4.6), while 
LeNet only uses weight decay. To augment the data even further, the training loop of AlexNet 
added a great deal of image augmentation, such as flipping, clipping, and color changes. This 
makes the model more robust and the larger sample size effectively reduces overfitting. We will 
discuss data augmentation in greater detail in Section 13.1. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


net = nn.Sequential() 
# Here, we use a larger 11 x 11 window to capture objects. At the same time, 
# we use a stride of 4 to greatly reduce the height and width of the output. 
# Here, the number of output channels is much larger than that in LeNet 
net.add(nn.Conv2D(96, kernel_size=11, strides=4, activation='relu’), 
nn.MaxPool2D(pool_size=3, strides=2), 
# Make the convolution window smaller, set padding to 2 for consistent 


(continues on next page) 
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(continued from previous page) 


# height and width across the input and output, and increase the 

# number of output channels 

nn.Conv2D(256, kernel_size=5, padding=2, activation='relu'), 
nn.MaxPool2D(pool_size=3, strides=2), 

Use three successive convolutional layers and a smaller convolution 
window. Except for the final convolutional layer, the number of 
output channels is further increased. Pooling layers are not used to 
reduce the height and width of input after the first two 
convolutional layers 

nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'), 
nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'), 
nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'), 
nn.MaxPool2D(pool_size=3, strides=2), 

# Here, the number of outputs of the fully-connected layer is several 
# times larger than that in LeNet. Use the dropout layer to mitigate 
# overfitting 

nn.Dense(4096, activation='relu'), nn.Dropout(0.5), 

nn.Dense(4096, activation='relu'), nn.Dropout(0.5), 

# Output layer. Since we are using Fashion-MNIST, the number of 

tt classes is 10, instead of 1000 as in the paper 

nn.Dense(10)) 


+ HHH HH 


We construct a single-channel data example with both height and width of 224 to observe the 
output shape of each layer. It matches the AlexNet architecture in Fig. 7.1.2. 


X = np.random.uniform(size=(1, 1, 224, 224)) 
net.initialize() 
for layer in net: 

X = layer(X) 

print(layer.name, ‘output shape:\t', X.shape) 


conv@ output shape: (1, 96, 54, 54) 
pool@ output shape: (1, 96, 26, 26) 
conv1 output shape: (1, 256, 26, 26) 
pool1 output shape: (1, 256, 12, 12) 
conv2 output shape: (1, 384, 12, 12) 
conv3 output shape: (1, 384, 12, 12) 
conv4 output shape: (1, 256, 12, 12) 
pool2 output shape: (1, 256, 5, 5) 


dense0 output shape: (1, 4096) 
dropout® output shape: (1, 4096) 
densel output shape: (1, 4096) 
dropout1 output shape: (1, 4096) 
dense2 output shape: (1, 10) 
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7.1.3 Reading the Dataset 


Although AlexNet is trained on ImageNet in the paper, we use Fashion-MNIST here since training 
an ImageNet model to convergence could take hours or days even on a modern GPU. One of the 
problems with applying AlexNet directly on Fashion-MNIST is that its images have lower resolu- 
tion (28 x 28 pixels) than ImageNet images. To make things work, we upsample them to 224 x 224 
(generally not a smart practice, but we do it here to be faithful to the AlexNet architecture). We 
perform this resizing with the resize argument in the d21.load_data_fashion_mnist function. 


batch_size = 128 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size, resize=224) 


7.1.4 Training 


Now, we can start training AlexNet. Compared with LeNet in Section 6.6, the main change here is 
the use of a smaller learning rate and much slower training due to the deeper and wider network, 
the higher image resolution, and the more costly convolutions. 


lr, num_epochs = 0.01, 10 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.336, train acc 0.878, test acc 0.882 
4107.6 examples/sec on gpu(0) 


—— train loss 
2.0 =-=- train acc 
—-= test acc 
1.5 
1.0 
0.5 





Summary 
e AlexNet has a similar structure to that of LeNet, but uses more convolutional layers and a 
larger parameter space to fit the large-scale ImageNet dataset. 


* Today AlexNet has been surpassed by much more effective architectures but it is a key step 
from shallow to deep networks that are used nowadays. 


+ Although it seems that there are only a few more lines in AlexNet’s implementation than in 
LeNet, it took the academic community many years to embrace this conceptual change and 
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take advantage of its excellent experimental results. This was also due to the lack of efficient 
computational tools. 


e Dropout, ReLU, and preprocessing were the other key steps in achieving excellent perfor- 
mance in computer vision tasks. 


Exercises 
1. Try increasing the number of epochs. Compared with LeNet, how are the results different? 
Why? 
2. AlexNet may be too complex for the Fashion-MNIST dataset. 


1. Try simplifying the model to make the training faster, while ensuring that the accuracy 
does not drop significantly. 


2. Design a better model that works directly on 28 x 28 images. 
3. Modify the batch size, and observe the changes in accuracy and GPU memory. 
4. Analyze computational performance of AlexNet. 
1. What is the dominant part for the memory footprint of AlexNet? 
2. What is the dominant part for computation in AlexNet? 
3. How about memory bandwidth when computing the results? 
5. Apply dropout and ReLU to LeNet-5. Does it improve? How about preprocessing? 


Discussions”? 


7.2 Networks Using Blocks (VGG) 


While AlexNet offered empirical evidence that deep CNNs can achieve good results, it did not 
provide a general template to guide subsequent researchers in designing new networks. In the 
following sections, we will introduce several heuristic concepts commonly used to design deep 
networks. 


Progress in this field mirrors that in chip design where engineers went from placing transistors to 
logical elements to logic blocks. Similarly, the design of neural network architectures had grown 
progressively more abstract, with researchers moving from thinking in terms of individual neu- 
rons to whole layers, and now to blocks, repeating patterns of layers. 


The idea of using blocks first emerged from the Visual Geometry Group” (VGG) at Oxford Univer- 
sity, in their eponymously-named VGG network. It is easy to implement these repeated structures 
in code with any modern deep learning framework by using loops and subroutines. 





 https://discuss.d21.ai/t/75 
 http://www.robots.ox.ac.uk/-vgg/ 
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7.2.1 VGG Blocks 


The basic building block of classic CNNs is a sequence of the following: (i) a convolutional layer 
with padding to maintain the resolution, (ii) a nonlinearity such as a ReLU, (iii) a pooling layer such 
as a max pooling layer. One VGG block consists of a sequence of convolutional layers, followed by 
a max pooling layer for spatial downsampling. In the original VGG paper (Simonyan & Zisserman, 
2014), the authors employed convolutions with 3 x 3 kernels with padding of 1 (keeping height and 
width) and 2 x 2 max pooling with stride of 2 (halving the resolution after each block). In the code 
below, we define a function called vgg_block to implement one VGG block. 


The function takes two arguments corresponding to the number of convolutional layers num_convs 
and the number of output channels num_channels. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


def vgg_block(num_convs, num_channels): 
blk = nn.Sequential() 
for _ in range(num_convs): 
blk.add(nn.Conv2D(num_channels, kernel_size=3, 
padding=1, activation='relu’)) 
blk.add(nn.MaxPool2D(pool_size=2, strides=2)) 
return blk 


7.2.2 VGG Network 
Like AlexNet and LeNet, the VGG Network can be partitioned into two parts: the first consisting 


mostly of convolutional and pooling layers and the second consisting of fully-connected layers. 
This is depicted in Fig. 7.2.1. 
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FC (1000) 


i 


FC (4096) VGG 


FC (4099) 





FC (4096) 


3 x 3 Conv (384), pad 1 


3x 3 Conv (384), pad 1 
VGG block 
3 x 3 Conv (384), pad 1 


4 
















3 x 3 Conv, pad 1 
5 x 5 Conv (256), pad 2 7 
3 x 3 Conv, pad 1 


Fig. 7.2.1: From AlexNet to VGG that is designed from building blocks. 





11 x 11 Conv (96), stride 4 








The convolutional part of the network connects several VGG blocks from Fig. 7.2.1 (also defined 
in the vgg_block function) in succession. The following variable conv_arch consists of a list of 
tuples (one per block), where each contains two values: the number of convolutional layers and 
the number of output channels, which are precisely the arguments required to call the vgg_block 
function. The fully-connected part of the VGG network is identical to that covered in AlexNet. 


The original VGG network had 5 convolutional blocks, among which the first two have one convo- 
lutional layer each and the latter three contain two convolutional layers each. The first block has 
64 output channels and each subsequent block doubles the number of output channels, until that 
number reaches 512. Since this network uses 8 convolutional layers and 3 fully-connected layers, 
it is often called VGG-11. 


conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512)) 


The following code implements VGG-11. This is a simple matter of executing a for-loop over 
conv_arch. 


def vgg(conv_arch): 

net = nn.Sequential() 

# The convolutional part 

for (num_convs, num_channels) in conv_arch: 

net .add(vgg_block(num_convs, num_channels)) 
# The fully-connected part 
net.add(nn.Dense(4096, activation='relu'), nn.Dropout(0.5), 
nn.Dense(4096, activation='relu'), nn.Dropout(0.5), 

n.Dense(10)) 
return net 


=: 


net = vgg(conv_arch) 
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Next, we will construct a single-channel data example with a height and width of 224 to observe 
the output shape of each layer. 


net.initialize() 
X = np.random.uniform(size=(1, 1, 224, 224)) 
for blk in net: 

X = blk(X) 

print(blk.name, ‘output shape:1t', X.shape) 


sequentiall output shape: (Gl, eh, le, aha) 
sequential2 output shape: (Ge MAS 56556) 
sequential3 output shape: (1, 256, 28, 28) 
sequential4 output shape: (Gl, Sila, Ml, 15) 
sequential5 output shape: Gi, BIZ. To IP) 
dense0 output shape: (1, 4096) 
dropout® output shape: (1, 4096) 

densel output shape: (1, 4096) 
dropout1 output shape: (1, 4096) 

dense2 output shape: (1, 10) 


As you can see, we halve height and width at each block, finally reaching a height and width of 7 
before flattening the representations for processing by the fully-connected part of the network. 


7.2.3 Training 


Since VGG-11 is more computationally-heavy than AlexNet we construct a network with a smaller 
number of channels. This is more than sufficient for training on Fashion-MNIST. 


ratio = 4 
small_conv_arch = [(pair[0], pairL1] // ratio) for pair in conv_arch] 
net = vgg(small_conv_arch) 


Apart from using a slightly larger learning rate, the model training process is similar to that of 
AlexNet in Section 7.1. 


lr, num_epochs, batch_size = 0.05, 10, 128 
train_iter, test_iter = d2l1.load_data_fashion_mnist(batch_size, resize=224) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.174, train acc 0.935, test acc 0.924 
1799.7 examples/sec on gpu(0) 
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—— train loss 
=== train acc 
—-- test acc 





Summary 


e VGG-11 constructs a network using reusable convolutional blocks. Different VGG models can 


be defined by the differences in the number of convolutional layers and output channels in 
each block. 


* The use of blocks leads to very compact representations of the network definition. It allows 
for efficient design of complex networks. 


+ In their VGG paper, Simonyan and Ziserman experimented with various architectures. In 
particular, they found that several layers of deep and narrow convolutions (i.e., 3 x 3) were 
more effective than fewer layers of wider convolutions. 


Exercises 


1. When printing out the dimensions of the layers we only saw 8 results rather than 11. Where 
did the remaining 3 layer information go? 


2. Compared with AlexNet, VGG is much slower in terms of computation, and it also needs 
more GPU memory. Analyze the reasons for this. 


3. Try changing the height and width of the images in Fashion-MNIST from 224 to 96. What 
influence does this have on the experiments? 


4. Refer to Table 1 in the VGG paper (Simonyan & Zisserman, 2014) to construct other common 
models, such as VGG-16 or VGG-19. 


Discussions”? 





% https://discuss.d21.ai/t/77 





7.2. Networks Using Blocks (VGG) 267 


7.3 Network in Network (NiN) 


LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting spatial 
structure via a sequence of convolution and pooling layers and then post-process the representa- 
tions via fully-connected layers. The improvements upon LeNet by AlexNet and VGG mainly lie in 
how these later networks widen and deepen these two modules. Alternatively, one could imagine 
using fully-connected layers earlier in the process. However, a careless use of dense layers might 
give up the spatial structure of the representation entirely, network in network (NiN) blocks offer 
an alternative. They were proposed based on a very simple insight: to use an MLP on the channels 
for each pixel separately (Lin et al., 2013). 


7.3.1 NiN Blocks 


Recall that the inputs and outputs of convolutional layers consist of four-dimensional tensors with 
axes corresponding to the example, channel, height, and width. Also recall that the inputs and 
outputs of fully-connected layers are typically two-dimensional tensors corresponding to the ex- 
ample and feature. The idea behind NiN is to apply a fully-connected layer at each pixel location 
(for each height and width). If we tie the weights across each spatial location, we could think of 
this as a 1 x 1 convolutional layer (as described in Section 6.4) or as a fully-connected layer acting 
independently on each pixel location. Another way to view this is to think of each element in the 
spatial dimension (height and width) as equivalent to an example and a channel as equivalent to 
a feature. 


Fig. 7.3.1 illustrates the main structural differences between VGG and NiN, and their blocks. The 
NiN block consists of one convolutional layer followed by two 1 x 1 convolutional layers that act 
as per-pixel fully-connected layers with ReLU activations. The convolution window shape of the 
first layer is typically set by the user. The subsequent window shapes are fixed to 1 x 1. 
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NiN 


3x3 Conv (10), pad 1 


3 x 3 Conv (384), pad 1 






VGG block 





NiN block 






5 x 5 Conv (256), pad 1 





Fig. 7.3.1: Comparing architectures of VGG and NiN, and their blocks. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


def nin_block(num_channels, kernel_size, strides, padding): 
blk = nn.Sequential() 
blk.add(nn.Conv2D(num_channels, kernel_size, strides, padding, 
activation='relu'), 
nn.Conv2D(num_channels, kernel_size=1, activation='relu'), 
nn.Conv2D(num_channels, kernel_size=1, activation='relu')) 
return blk 
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7.3.2 NiN Model 


The original NiN network was proposed shortly after AlexNet and clearly draws some inspiration. 
NiN uses convolutional layers with window shapes of 11x11, 5x5, and 3x3, and the corresponding 
numbers of output channels are the same asin AlexNet. Each NiN blockis followed by a maximum 
pooling layer with a stride of 2 and a window shape of 3 x 3. 


One significant difference between NiN and AlexNet is that NiN avoids fully-connected layers al- 
together. Instead, NiN uses an NiN block with a number of output channels equal to the number 
of label classes, followed by a global average pooling layer, yielding a vector of logits. One ad- 
vantage of NiN’s design is that it significantly reduces the number of required model parameters. 
However, in practice, this design sometimes requires increased model training time. 


net = nn.Sequential() 

net.add(nin_block(96, kernel_size=11, strides=4, padding=0), 
nn.MaxPool2D(pool_size=3, strides=2), 
nin_block(256, kernel_size=5, strides=1, padding=2), 
nn.MaxPool2D(pool_size=3, strides=2), 
nin_block(384, kernel_size=3, strides=1, padding=1), 
nn.MaxPool2D(pool_size=3, strides=2), 
nn.Dropout(0.5), 
# There are 10 label classes 
nin_block(10, kernel_size=3, strides=1, padding=1), 
# The global average pooling layer automatically sets the window shape 
# to the height and width of the input 
nn.GlobalAvgPoo12D(), 
# Transform the four-dimensional output into two-dimensional output 
# with a shape of (batch size, 10) 
nn.Flatten()) 


We create a data example to see the output shape of each block. 


X = np.random.uniform(size=(1, 1, 224, 224)) 
net.initialize() 
for layer in net: 

X = layer (X) 

print(layer.name, ‘output shape:\t', X.shape) 


sequentiall output shape: (1, 96, 54, 54) 
pool@ output shape: (1, 96, 26, 26) 
sequential2 output shape: Gy 256), 265) 26) 
pooll output shape: (1, 256, 12, 12) 
sequential3 output shape: Cl, 26th, 12, 12) 
pool2 output shape: (1, 384, 5, 5) 

dropout® output shape: GR 3845555) 
sequential4 output shape: (Go O, Sa 9) 
pool3 output shape: (1, 10, 1, 1) 

flatten® output shape: (1, 10) 
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7.3.3 Training 


As before we use Fashion-MNIST to train the model. NiN’s training is similar to that for AlexNet 
and VGG. 


lr, num_epochs, batch_size = 0.1, 10, 128 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size, resize=224) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.356, train acc 0.869, test acc 0.868 
2963.0 examples/sec on gpu(@) 


—— train loss 
=== train acc 
—-= test acc 


2.0 


1.5 


1.0 


0.5 


0.0 





Summary 
e NiN uses blocks consisting of a convolutional layer and multiple 1 x 1 convolutional layers. 
This can be used within the convolutional stack to allow for more per-pixel nonlinearity. 


+ NiN removes the fully-connected layers and replaces them with global average pooling (i.e., 
summing over all locations) after reducing the number of channels to the desired number 
of outputs (e.g., 10 for Fashion-MNIST). 


e Removing the fully-connected layers reduces overfitting. NiN has dramatically fewer pa- 
rameters. 


* The NiN design influenced many subsequent CNN designs. 


Exercises 


1. Tune the hyperparameters to improve the classification accuracy. 


2. Why are there two 1 x 1 convolutional layers in the NiN block? Remove one of them, and 
then observe and analyze the experimental phenomena. 


3. Calculate the resource usage for NiN. 
1. What is the number of parameters? 


2. What is the amount of computation? 
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3. What is the amount of memory needed during training? 
4. Whatis the amount of memory needed during prediction? 


4. What are possible problems with reducing the 384 x 5 x 5 representation to a 10 x 5 x 5 
representation in one step? 


Discussions” 


7.4 Networks with Parallel Concatenations (GoogLeNet) 


In 2014, GoogLeNet won the ImageNet Challenge, proposing a structure that combined the 
strengths of NiN and paradigms of repeated blocks (Szegedy et al., 2015). One focus of the paper 
was to address the question of which sized convolution kernels are best. After all, previous popu- 
lar networks employed choices as small as 1 x 1 and as large as 11 x 11. One insight in this paper 
was that sometimes it can be advantageous to employ a combination of variously-sized kernels. In 
this section, we will introduce GoogLeNet, presenting a slightly simplified version of the original 
model: we omit a few ad-hoc features that were added to stabilize training but are unnecessary 
now with better training algorithms available. 


7.4.1 Inception Blocks 


The basic convolutional block in GoogLeNet is called an Inception block, likely named due to a quote 
from the movie Inception (“We need to go deeper”), which launched a viral meme. 





Concatenation 









Fig. 7.4.1: Structure of the Inception block. 


As depicted in Fig. 7.4.1, the inception block consists of four parallel paths. The first three paths 
use convolutional layers with window sizes of 1 x 1, 3 x 3, and 5 x 5 to extract information from 
different spatial sizes. The middle two paths perform a 1 x 1 convolution on the input to reduce 
the number of channels, reducing the model’s complexity. The fourth path uses a 3 x 3 maximum 
pooling layer, followed by a 1 x 1 convolutional layer to change the number of channels. The 
four paths all use appropriate padding to give the input and output the same height and width. 
Finally, the outputs along each path are concatenated along the channel dimension and comprise 
the block’s output. The commonly-tuned hyperparameters of the Inception block are the number 
of output channels per layer. 





© https://discuss.d21.ai/t/79 
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from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 


npx.set_ 


np() 


class Inception(nn.Block): 
# `c1`--`c4` are the number of output channels for each path 


def 


def 


__init__(self, cl, c2, c3, c4, **xkwargs): 

super(Inception, self).__init__(**kwargs) 

# Path 1 is a single 1 x 1 convolutional layer 

self.p1_1 = nn.Conv2D(c1, kernel_size=1, activation='relu’) 

# Path 2 is a 1 x 1 convolutional layer followed by a 3 x 3 

# convolutional layer 

self.p2_1 = nn.Conv2D(c2[0], kernel_size=1, activation='relu') 

self.p2_2 = nn.Conv2D(c2[1], kernel_size=3, padding=1, 
activation='relu’) 

# Path 3 is a 1 x 1 convolutional layer followed by a 5 x 5 

# convolutional layer 

self.p3_1 = nn.Conv2D(c3[0], kernel_size=1, activation='relu’) 

self.p3_2 = nn.Conv2D(c3[1], kernel_size=5, padding=2, 
activation='relu’) 

# Path 4 is a 3 x 3 maximum pooling layer followed by a 1 x 1 

# convolutional layer 

self.p4_1 = nn.MaxPool2D(pool_size=3, strides=1, padding=1) 

self.p4_2 = nn.Conv2D(c4, kernel_size=1, activation='relu’) 


forward(self, x): 

pl = self.pl_1(x) 

p2 = self .p2_2(self.p2_1(x)) 

p3 = self .p3_2(self.p3_1(x)) 

p4 = self .p4_2(self.p4_1(x)) 

# Concatenate the outputs on the channel dimension 
return np.concatenate((p1, p2, p3, p4), axis=1) 


To gain some intuition for why this network works so well, consider the combination of the filters. 
They explore the image in a variety of filter sizes. This means that details at different extents can 
be recognized efficiently by filters of different sizes. At the same time, we can allocate different 
amounts of parameters for different filters. 


7.4.2 GoogLeNet Model 


As shown in Fig. 7.4.2, GoogLeNet uses a stack of a total of 9 inception blocks and global aver- 
age pooling to generate its estimates. Maximum pooling between inception blocks reduces the 
dimensionality. The first module is similar to AlexNet and LeNet. The stack of blocks is inherited 
from VGG and the global average pooling avoids a stack of fully-connected layers at the end. 
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3x 3 MaxPool 


3 x 3 MaxPool 


Fig. 7.4.2: The GoogLeNet architecture. 





We can now implement GoogLeNet piece by piece. The first module uses a 64-channel 7 x 7 con- 
volutional layer. 


b1 = nn.Sequential() 
b1.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3, activation='relu'), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


The second module uses two convolutional layers: first, a 64-channel 1 x 1 convolutional layer, 
then a 3 x 3 convolutional layer that triples the number of channels. This corresponds to the 
second path in the Inception block. 


b2 = nn.Sequential() 

b2.add(nn.Conv2D(64, kernel_size=1, activation='relu'), 
nn.Conv2D(192, kernel_size=3, padding=1, activation='relu'), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


The third module connects two complete Inception blocks in series. The number of output chan- 
nels of the first Inception block is 64 + 128 + 32 + 32 = 256, and the number-of-output-channel 
ratio among the four paths is 64 : 128 : 32 : 32 = 2: 4:1: 1. The second and third paths first 
reduce the number of input channels to 96/192 = 1/2 and 16/192 = 1/12, respectively, and then 
connect the second convolutional layer. The number of output channels of the second Inception 
block is increased to 128 + 192 + 96 + 64 = 480, and the number-of-output-channel ratio among 
the four paths is 128 : 192 : 96 : 64 = 4: 6: 3 : 2. The second and third paths first reduce the 
number of input channels to 128/256 = 1/2 and 32/256 = 1/8, respectively. 
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b3 = nn.Sequential() 

b3.add(Inception(64, (96, 128), (16, 32), 32), 
Inception(128, (128, 192), (32, 96), 64), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


The fourth module is more complicated. It connects five Inception blocks in series, and they have 
192+208+48+64 = 512, 160+224+64+64 = 512, 128+256+64+64 = 512, 112+288+64+64 = 528, 
and 256+320+128+128 = 832 output channels, respectively. The number of channels assigned to 
these paths is similar to that in the third module: the second path with the 3 x 3 convolutional layer 
outputs the largest number of channels, followed by the first path with only the 1 x 1 convolutional 
layer, the third path with the 5 x 5 convolutional layer, and the fourth path with the 3 x 3 maximum 
pooling layer. The second and third paths will first reduce the number of channels according to 
the ratio. These ratios are slightly different in different Inception blocks. 


b4 = nn.Sequential() 

b4.add(Inception(192, (96, 208), (16, 48), 64), 
Inception(160, (112, 224), (24, 64), 64), 
Inception(128, (128, 256), (24, 64), 64), 
Inception(112, (144, 288), (32, 64), 64), 
Inception(256, (160, 320), (32, 128), 128), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


The fifth module has two Inception blocks with 256 + 320 + 128 + 128 = 832 and 384 + 384 + 128 + 
128 = 1024 output channels. The number of channels assigned to each path is the same as that in 
the third and fourth modules, but differs in specific values. It should be noted that the fifth block is 
followed by the output layer. This block uses the global average pooling layer to change the height 
and width of each channel to 1, just as in NiN. Finally, we turn the output into a two-dimensional 
array followed by a fully-connected layer whose number of outputs is the number of label classes. 


b5 = nn.Sequential() 

b5.add(Inception(256, (160, 320), (32, 128), 128), 
Inception(384, (192, 384), (48, 128), 128), 
nn.GlobalAvgPoo12D()) 


net = nn.Sequential() 
net.add(b1, b2, b3, b4, b5, nn.Dense(10)) 


The GoogLeNet model is computationally complex, so it is not as easy to modify the number of 
channels as in VGG. To have a reasonable training time on Fashion-MNIST, we reduce the input 
height and width from 224 to 96. This simplifies the computation. The changes in the shape of the 
output between the various modules are demonstrated below. 


X = np.random.uniform(size=(1, 1, 96, 96)) 
net.initialize() 
for layer in net: 

X = layer(X) 

print(layer.name, ‘output shape:\t', X.shape) 


sequential® output shape: (1, 64, 24, 24) 
sequentiall output shape: (GE ME, 112, 12) 
sequential2 output shape: (1, 480, 6, 6) 
sequential3 output shape: T822 de 3) 


(continues on next page) 
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sequential4 output shape: Cia MOZ la 1) 
dense0 output shape: (1, 10) 


7.4.3 Training 


As before, we train our model using the Fashion-MNIST dataset. We transform it to 96 x 96 pixel 
resolution before invoking the training procedure. 


lr, num_epochs, batch_size = 0.1, 10, 128 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size, resize=96) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.259, train acc 0.902, test acc 0.884 
2307.5 examples/sec on gpu(0) 


—— train loss 
2.0 === train acc 
—-- test acc 
1.5 
1.0 
0.5 





Summary 


e The Inception block is equivalent to a subnetwork with four paths. It extracts information 
in parallel through convolutional layers of different window shapes and maximum pooling 
layers. 1 x 1 convolutions reduce channel dimensionality on a per-pixel level. Maximum 
pooling reduces the resolution. 


e GoogLeNet connects multiple well-designed Inception blocks with other layers in series. 
The ratio of the number of channels assigned in the Inception block is obtained through a 
large number of experiments on the ImageNet dataset. 


e GoogLeNet, as well as its succeeding versions, was one of the most efficient models on Im- 
ageNet, providing similar test accuracy with lower computational complexity. 
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Exercises 


1. There are several iterations of GoogLeNet. Try to implement and run them. Some of them 
include the following: 


e Add a batch normalization layer (Ioffe & Szegedy, 2015), as described later in Section 
dad: 


e Make adjustments to the Inception block (Szegedy et al., 2016). 
+ Use label smoothing for model regularization (Szegedy et al., 2016). 


* Include it in the residual connection (Szegedy et al., 2017), as described later in Section 
7.6. 


2. What is the minimum image size for GoogLeNet to work? 


3. Compare the model parameter sizes of AlexNet, VGG, and NiN with GoogLeNet. How do the 
latter two network architectures significantly reduce the model parameter size? 


Discussions?’ 


7.5 Batch Normalization 


Training deep neural networks is difficult. And getting them to converge in a reasonable amount 
of time can be tricky. In this section, we describe batch normalization, a popular and effective 
technique that consistently accelerates the convergence of deep networks (Ioffe & Szegedy, 2015). 
Together with residual blocks—covered later in Section 7.6—batch normalization has made it pos- 
sible for practitioners to routinely train networks with over 100 layers. 


7.5.1 Training Deep Networks 


To motivate batch normalization, let us review a few practical challenges that arise when training 
machine learning models and neural networks in particular. 


First, choices regarding data preprocessing often make an enormous difference in the final re- 
sults. Recall our application of MLPs to predicting house prices (Section 4.10). Our first step when 
working with real data was to standardize our input features to each have a mean of zero and vari- 
ance of one. Intuitively, this standardization plays nicely with our optimizers because it puts the 
parameters a priori at a similar scale. 


Second, for a typical MLP or CNN, as we train, the variables (e.g., affine transformation outputs in 
MLP) in intermediate layers may take values with widely varying magnitudes: both along the lay- 
ers from the input to the output, across units in the same layer, and over time due to our updates to 
the model parameters. The inventors of batch normalization postulated informally that this drift 
in the distribution of such variables could hamper the convergence of the network. Intuitively, 
we might conjecture that if one layer has variable values that are 100 times that of another layer, 
this might necessitate compensatory adjustments in the learning rates. 


Third, deeper networks are complex and easily capable of overfitting. This means that regular- 
ization becomes more critical. 





 https://discuss.d21.ai/t/81 
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Batch normalization is applied to individual layers (optionally, to all ofthem) and works as follows: 
In each training iteration, we first normalize the inputs (of batch normalization) by subtracting 
their mean and dividing by their standard deviation, where both are estimated based on the statis- 
tics of the current minibatch. Next, we apply a scale coefficient and a scale offset. Itis precisely 
due to this normalization based on batch statistics that batch normalization derives its name. 


Note that if we tried to apply batch normalization with minibatches of size 1, we would not be able 
to learn anything. That is because after subtracting the means, each hidden unit would take value 
0! As you might guess, since we are devoting a whole section to batch normalization, with large 
enough minibatches, the approach proves effective and stable. One takeaway here is that when 
applying batch normalization, the choice of batch size may be even more significant than without 
batch normalization. 


Formally, denoting by x € B an input to batch normalization (BN) that is from a minibatch B, 
batch normalization transforms x according to the following expression: 


BN(x) =y 0 “ts +8. (7.5.1) 


In (7.5.1), Êg is the sample mean and ôg is the sample standard deviation of the minibatch B. 
After applying standardization, the resulting minibatch has zero mean and unit variance. Because 
the choice of unit variance (vs. some other magic number) is an arbitrary choice, we commonly 
include elementwise scale parameter y and shift parameter 3 that have the same shape as x. Note 
that y and 8 are parameters that need to be learned jointly with the other model parameters. 


Consequently, the variable magnitudes for intermediate layers cannot diverge during training be- 
cause batch normalization actively centers and rescales them back to a given mean and size (via 
fig and ôg). One piece of practitioner’s intuition or wisdom is that batch normalization seems to 
allow for more aggressive learning rates. 


Formally, we calculate fig and ôg in (7.5.1) as follows: 


a 1 
Mg = 5 2* 


xcbB 


E 1 a 
ôb = [B| Sox fig)” FE. 


xcB 


(7.5.2) 





Note that we add a small constant e > 0 to the variance estimate to ensure that we never attempt 
division by zero, even in cases where the empirical variance estimate might vanish. The estimates 
fig and ôg counteract the scaling issue by using noisy estimates of mean and variance. You might 
think that this noisiness should be a problem. As it turns out, this is actually beneficial. 


This turns out to be a recurring theme in deep learning. For reasons that are not yet well- 
characterized theoretically, various sources of noise in optimization often lead to faster training 
and less overfitting: this variation appears to act as a form of regularization. In some preliminary 
research, (Teye et al., 2018) and (Luo et al., 2018) relate the properties of batch normalization to 
Bayesian priors and penalties respectively. In particular, this sheds some light on the puzzle of 
why batch normalization works best for moderate minibatches sizes in the 50 ~ 100 range. 


Fixing a trained model, you might think that we would prefer using the entire dataset to estimate 
the mean and variance. Once training is complete, why would we want the same image to be clas- 
sified differently, depending on the batch in which it happens to reside? During training, such 
exact calculation is infeasible because the intermediate variables for all data examples change ev- 
ery time we update our model. However, once the model is trained, we can calculate the means 
and variances of each layer's variables based on the entire dataset. Indeed this is standard practice 
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for models employing batch normalization and thus batch normalization layers function differ- 
ently in training mode (normalizing by minibatch statistics) and in prediction mode (normalizing by 
dataset statistics). 


We are now ready to take a look at how batch normalization works in practice. 


7.5.2 Batch Normalization Layers 


Batch normalization implementations for fully-connected layers and convolutional layers are 
slightly different. We discuss both cases below. Recall that one key differences between batch 
normalization and other layers is that because batch normalization operates on a full minibatch 
at a time, we cannot just ignore the batch dimension as we did before when introducing other 
layers. 


Fully-Connected Layers 


When applying batch normalization to fully-connected layers, the original paper inserts batch 
normalization after the affine transformation and before the nonlinear activation function (later 
applications may insert batch normalization right after activation functions) (Ioffe € Szegedy, 
2015). Denoting the input to the fully-connected layer by x, the affine transformation by Wx + b 
(with the weight parameter W and the bias parameter b), and the activation function by ¢, we 
can express the computation of a batch-normalization-enabled, fully-connected layer output h as 
follows: 


h = ¿(BN(Wx + b)). (7.5.3) 


Recall that mean and variance are computed on the same minibatch on which the transformation 
is applied. 


Convolutional Layers 


Similarly, with convolutional layers, we can apply batch normalization after the convolution and 
before the nonlinear activation function. When the convolution has multiple output channels, we 
need to carry out batch normalization for each of the outputs of these channels, and each channel 
has its own scale and shift parameters, both of which are scalars. Assume that our minibatches 
contain m examples and that for each channel, the output of the convolution has height p and 
width q. For convolutional layers, we carry out each batch normalization over the m-p-q elements 
per output channel simultaneously. Thus, we collect the values over all spatial locations when 
computing the mean and variance and consequently apply the same mean and variance within a 
given channel to normalize the value at each spatial location. 
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Batch Normalization During Prediction 


As we mentioned earlier, batch normalization typically behaves differently in training mode and 
prediction mode. First, the noise in the sample mean and the sample variance arising from es- 
timating each on minibatches are no longer desirable once we have trained the model. Second, 
we might not have the luxury of computing per-batch normalization statistics. For example, we 
might need to apply our model to make one prediction at a time. 


Typically, after training, we use the entire dataset to compute stable estimates of the variable 
statistics and then fix them at prediction time. Consequently, batch normalization behaves dif- 
ferently during training and at test time. Recall that dropout also exhibits this characteristic. 


7.5.3 Implementation from Scratch 


Below, we implement a batch normalization layer with tensors from scratch. 


from d21 import mxnet as d21 

from mxnet import autograd, np, npx, init 
from mxnet.gluon import nn 

npx.set_np() 


def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum): 
# Use ‘autograd* to determine whether the current mode is training mode or 
# prediction mode 
if not autograd.is_training(): 
# If it is prediction mode, directly use the mean and variance 
# obtained by moving average 
X_hat = (X - moving_mean) / np.sqrt(moving_var + eps) 
else: 
assert len(X.shape) in (2, 4) 
if len(X.shape) == 2: 
# When using a fully-connected layer, calculate the mean and 
# variance on the feature dimension 
mean = X.mean(axis=0) 
var = ((X - mean) ** 2).mean(axis=0) 
else: 
# When using a two-dimensional convolutional layer, calculate the 
# mean and variance on the channel dimension (axis=1). Here we 
# need to maintain the shape of `X`, so that the broadcasting 
# operation can be carried out later 
mean = X.mean(axis=(0, 2, 3), keepdims=True) 
var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True) 
# In training mode, the current mean and variance are used for the 
# standardization 
X_hat = (X - mean) / np.sqrt(var + eps) 
# Update the mean and variance using moving average 
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean 
moving_var = momentum * moving_var + (1.0 - momentum) * var 
Y = gamma * X_hat + beta + Scale and shift 
return Y, moving_mean, moving_var 


We can now create a proper BatchNorm layer. Our layer will maintain proper parameters for scale 
gamma and shift beta, both of which will be updated in the course of training. Additionally, our 
layer will maintain moving averages of the means and variances for subsequent use during model 
prediction. 
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Putting aside the algorithmic details, note the design pattern underlying our implementation of 
the layer. Typically, we define the mathematics in a separate function, say batch_norm. We then 
integrate this functionality into a custom layer, whose code mostly addresses bookkeeping mat- 
ters, such as moving data to the right device context, allocating and initializing any required vari- 
ables, keeping track of moving averages (here for mean and variance), and so on. This pattern 
enables a clean separation of mathematics from boilerplate code. Also note that for the sake of 
convenience we did not worry about automatically inferring the input shape here, thus we need 
to specify the number of features throughout. Do not worry, the high-level batch normalization 
APIs in the deep learning framework will care of this for us and we will demonstrate that later. 


class BatchNorm(nn.Block): 
# ‘num_features*: the number of outputs for a fully-connected layer 
# or the number of output channels for a convolutional layer. ‘num_dims*: 
# 2 for a fully-connected layer and 4 for a convolutional layer 
def __init__(self, num_features, num_dims, **kwargs): 
super().__init__(**kwargs) 
if num_dims == 2: 
shape = (1, num_features) 
else: 
shape = (1, num_features, 1, 1) 
# The scale parameter and the shift parameter (model parameters) are 
# initialized to 1 and 0, respectively 
self.gamma = self.params.get(’gamma’, shape=shape, init=init.One()) 
self.beta = self.params.get('beta', shape=shape, init=init.Zero()) 
# The variables that are not model parameters are initialized to @ and 1 
self.moving_mean = np.zeros(shape) 
self .moving_var = np.ones(shape) 


def forward(self, X): 

# If `X` is not on the main memory, copy ‘moving_mean* and 

# ‘moving_var* to the device where `X` is located 

if self.moving_mean.ctx != X.ctx: 
self.moving_mean = self .moving_mean.copyto(X.ctx) 
self.moving_var = self.moving_var.copyto(X.ctx) 

# Save the updated ‘moving_mean* and ‘moving_var* 

Y, self.moving_mean, self.moving_var = batch_norm( 
X, self.gamma.data(), self.beta.data(), self.moving_mean, 
self.moving_var, eps=1e-12, momentum=0.9) 

return Y 


7.5.4 Applying Batch Normalization in LeNet 


To see how to apply BatchNorm in context, below we apply it to a traditional LeNet model (Section 
6.6). Recall that batch normalization is applied after the convolutional layers or fully-connected 
layers but before the corresponding activation functions. 


net = nn.Sequential() 

net.add(nn.Conv2D(6, kernel_size=5) , 
BatchNorm(6, num_dims=4) , 
nn.Activation(’ sigmoid’), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Conv2D(16, kernel_size=5), 
BatchNorm(16, num_dims=4), 


(continues on next page) 
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nn.Activation('sigmoid'), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Dense(120), 

BatchNorm(120, num_dims=2), 
nn.Activation('sigmoid'), 
nn.Dense(84), 

BatchNorm(84, num_dims=2), 
nn.Activation('sigmoid'), 
nn.Dense(10)) 


As before, we will train our network on the Fashion-MNIST dataset. This code is virtually identical 
to that when we first trained LeNet (Section 6.6). The main difference is the considerably larger 
learning rate. 


lr, num_epochs, batch_size = 1.0, 10, 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.244, train acc 0.910, test acc 0.834 
18342.5 examples/sec on gpu(0) 


A a lindos 
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> — train loss 


=== train acc 


—-- test acc 





epoch 


Let us have a look at the scale parameter gamma and the shift parameter beta learned from the first 
batch normalization layer. 


net[1].gamma.data().reshape(-1,), net[1].beta.data().reshape(-1,) 


(array([2.4347134, 1.0665272, 2.5463505, 1.9349247, 2.4168515, 1.2601634], ctx=gpu(0)), 
array(L 1.6742698e+00, 4.3546245e-02, -2.8972890e+00, 2.1968324e-01, 
-1.4296900e-03, -6.1734778e-01], ctx=gpu(0))) 
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7.5.5 Concise Implementation 


Compared with the BatchNorm class, which we just defined ourselves, we can use the BatchNorm 
class defined in high-level APIs from the deep learning framework directly. The code looks virtu- 
ally identical to the application our implementation above. 


net = nn.Sequential() 

net.add(nn.Conv2D(6, kernel_size=5) , 
nn.BatchNorm() , 
nn.Activation('sigmoid'), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Conv2D(16, kernel_size=5), 
nn.BatchNorm(), 
nn.Activation('sigmoid'), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Dense(120), 
nn.BatchNorm(), 
nn.Activation('sigmoid'), 
nn.Dense(84), 
nn.BatchNorm() , 
nn.Activation('sigmoid'), 
nn.Dense(10)) 


Below, we use the same hyperparameters to train our model. Note that as usual, the high-level API 
variant runs much faster because its code has been compiled to C++ or CUDA while our custom 
implementation must be interpreted by Python. 


d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.250, train acc 0.909, test acc 0.849 
37957.7 examples/sec on gpu(0) 


1.2 
1.0 
0.8 

=== train acc 


0.6 4 —= test acc 
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7.5.6 Controversy 


Intuitively, batch normalization is thought to make the optimization landscape smoother. How- 
ever, we must be careful to distinguish between speculative intuitions and true explanations for 
the phenomena that we observe when training deep models. Recall that we do not even know why 
simpler deep neural networks (MLPs and conventional CNNs) generalize well in the first place. 
Even with dropout and weight decay, they remain so flexible that their ability to generalize to 
unseen data cannot be explained via conventional learning-theoretic generalization guarantees. 


In the original paper proposing batch normalization, the authors, in addition to introducing a 
powerful and useful tool, offered an explanation for why it works: by reducing internal covari- 
ate shift. Presumably by internal covariate shift the authors meant something like the intuition 
expressed above—the notion that the distribution of variable values changes over the course of 
training. However, there were two problems with this explanation: i) This drift is very different 
from covariate shift, rendering the name a misnomer. ii) The explanation offers an under-specified 
intuition but leaves the question of why precisely this technique works an open question wanting for 
a rigorous explanation. Throughout this book, we aim to convey the intuitions that practitioners 
use to guide their development of deep neural networks. However, we believe that it is important 
to separate these guiding intuitions from established scientific fact. Eventually, when you master 
this material and start writing your own research papers you will want to be clear to delineate 
between technical claims and hunches. 


Following the success of batch normalization, its explanation in terms of internal covariate shift 
has repeatedly surfaced in debates in the technical literature and broader discourse about how to 
present machine learning research. In a memorable speech given while accepting a Test of Time 
Award at the 2017 NeurIPS conference, Ali Rahimi used internal covariate shift as a focal point in 
an argument likening the modern practice of deep learning to alchemy. Subsequently, the ex- 
ample was revisited in detail in a position paper outlining troubling trends in machine learning 
(Lipton & Steinhardt, 2018). Other authors have proposed alternative explanations for the success 
of batch normalization, some claiming that batch normalization’s success comes despite exhibit- 
ing behavior that is in some ways opposite to those claimed in the original paper (Santurkar et al., 
2018). 


We note that the internal covariate shift is no more worthy of criticism than any of thousands of 
similarly vague claims made every year in the technical machine learning literature. Likely, its 
resonance as a focal point of these debates owes to its broad recognizability to the target audience. 
Batch normalization has proven an indispensable method, applied in nearly all deployed image 
classifiers, earning the paper that introduced the technique tens of thousands of citations. 


Summary 


* During model training, batch normalization continuously adjusts the intermediate output 
of the neural network by utilizing the mean and standard deviation of the minibatch, so that 
the values of the intermediate output in each layer throughout the neural network are more 
stable. 


* The batch normalization methods for fully-connected layers and convolutional layers are 
slightly different. 


+ Like a dropout layer, batch normalization layers have different computation results in train- 
ing mode and prediction mode. 
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e Batch normalization has many beneficial side effects, primarily that of regularization. On 
the other hand, the original motivation of reducing internal covariate shift seems not to be 
a valid explanation. 


Exercises 
1. Can we remove the bias parameter from the fully-connected layer or the convolutional layer 
before the batch normalization? Why? 
2. Compare the learning rates for LeNet with and without batch normalization. 
1. Plot the increase in training and test accuracy. 
2. How large can you make the learning rate? 
. Do we need batch normalization in every layer? Experiment with it? 
. Can you replace dropout by batch normalization? How does the behavior change? 


. Fix the parameters beta and gamma, and observe and analyze the results. 


nH oOo A WwW 


. Review the online documentation for BatchNorm from the high-level APIs to see the other 
applications for batch normalization. 


7. Research ideas: think of other normalization transforms that you can apply? Can you apply 
the probability integral transform? How about a full rank covariance estimate? 


Discussions’? 


7.6 Residual Networks (ResNet) 


As we design increasingly deeper networks it becomes imperative to understand how adding lay- 
ers can increase the complexity and expressiveness of the network. Even more important is the 
ability to design networks where adding layers makes networks strictly more expressive rather 
than just different. To make some progress we need a bit of mathematics. 


7.6.1 Function Classes 


Consider F, the class of functions that a specific network architecture (together with learning 
rates and other hyperparameter settings) can reach. That is, for all f € F there exists some set of 
parameters (e.g., weights and biases) that can be obtained through training on a suitable dataset. 
Let us assume that f* is the “truth” function that we really would like to find. If itis in F, we are in 
good shape but typically we will not be quite so lucky. Instead, we will try to find some f} which 
is our best bet within F. For instance, given a dataset with features X and labels y, we might try 
finding it by solving the following optimization problem: 


fe © argmin L(X, y, f) subject to f € F. (7.6.1) 
j 


It is only reasonable to assume that if we design a different and more powerful architecture F’ we 
should arrive at a better outcome. In other words, we would expect that ft, is “better” than f}. 
However, if F Z F there is no guarantee that this should even happen. In fact, f}, might well 





% https://discuss.d21.ai/t/83 
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be worse. As illustrated by Fig. 7.6.1, for non-nested function classes, a larger function class does 
not always move closer to the “truth” function f*. For instance, on the left of Fig. 7.6.1, though 
F; is closer to f* than F,, Fe moves away and there is no guarantee that further increasing the 
complexity can reduce the distance from f*. With nested function classes where F; C ... C Fe 
on the right of Fig. 7.6.1, we can avoid the aforementioned issue from the non-nested function 
classes. 








Non-nested function classes Nested function classes 


Fig. 7.6.1: For non-nested function classes, a larger (indicated by area) function class does not 
guarantee to get closer to the “truth” function (f*). This does not happen in nested function 
classes. 


Thus, only if larger function classes contain the smaller ones are we guaranteed that increasing 
them strictly increases the expressive power of the network. For deep neural networks, if we can 
train the newly-added layer into an identity function f(x) = x, the new model will be as effective 
as the original model. As the new model may get a better solution to fit the training dataset, the 
added layer might make it easier to reduce training errors. 


This is the question that He et al. considered when working on very deep computer vision models 
(He et al., 2016a). At the heart of their proposed residual network (ResNet) is the idea that every 
additional layer should more easily contain the identity function as one of its elements. These 
considerations are rather profound but they led to a surprisingly simple solution, a residual block. 
With it, ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015. The design 
had a profound influence on how to build deep neural networks. 


7.6.2 Residual Blocks 


Let us focus on a local part of a neural network, as depicted in Fig. 7.6.2. Denote the input by x. 
We assume that the desired underlying mapping we want to obtain by learning is f(x), to be used 
as the input to the activation function on the top. On the left of Fig. 7.6.2, the portion within the 
dotted-line box must directly learn the mapping f(x). On the right, the portion within the dotted- 
line box needs to learn the residual mapping f(x) — x, which is how the residual block derives its 
name. If the identity mapping f(x) = x is the desired underlying mapping, the residual mapping 
is easier to learn: we only need to push the weights and biases of the upper weight layer (e.g., 
fully-connected layer and convolutional layer) within the dotted-line box to zero. The right figure 
in Fig. 7.6.2 illustrates the residual block of ResNet, where the solid line carrying the layer input x to 
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the addition operator is called a residual connection (or shortcut connection). With residual blocks, 
inputs can forward propagate faster through the residual connections across layers. 





Activation function Activation function 


Fig. 7.6.2: A regular block (left) and a residual block (right). 


ResNet follows VGG's full 3 x 3 convolutional layer design. The residual block has two 3 x 3 con- 
volutional layers with the same number of output channels. Each convolutional layer is followed 
by a batch normalization layer and a ReLU activation function. Then, we skip these two convolu- 
tion operations and add the input directly before the final ReLU activation function. This kind of 
design requires that the output of the two convolutional layers has to be of the same shape as the 
input, so that they can be added together. If we want to change the number of channels, we need 
to introduce an additional 1 x 1 convolutional layer to transform the input into the desired shape 
for the addition operation. Let us have a look at the code below. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


class Residual(nn.Block): #@save 
"""The Residual block of ResNet.”"" 
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs): 
super().__init__(**kwargs) 
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1, 
strides=strides) 
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1) 
if use_1x1conv: 
self.conv3 = nn.Conv2D(num_channels, kernel_size=1, 
strides=strides) 
else: 
self.conv3 = None 
self.bn1 = nn.BatchNorm() 
self.bn2 = nn.BatchNorm() 


def forward(self, X): 


(continues on next page) 
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(continued from previous page) 


Y = npx.relu(self.bn1(self.conv1(X))) 
Y = self.bn2(self.conv2(Y)) 
if self.conv3: 
X = self.conv3(X) 
return npx.relu(Y + X) 


This code generates two types of networks: one where we add the input to the output before ap- 


plying the ReLU nonlinearity whenever use_1x1conv=False, and one where we adjust channels 
and resolution by means of a1 x 1 convolution before adding. Fig. 7.6.3 illustrates this: 
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Fig. 7.6.3: ResNet block with and without 1 x 1 convolution. 


Now let us look at a situation where the input and output are of the same shape. 


blk = Residual(3) 

blk. initialize() 

X = np.random.uniform(size=(4, 3, 6, 6)) 
b1k(X) . shape 


(4, 3, 6, 6) 


We also have the option to halve the output height and width while increasing the number of 
output channels. 


blk = Residual(6, use_1x1conv=True, strides=2) 
blk.initialize() 
b1k(X).shape 
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(4, 6, 3, 3) 


7.6.3 ResNet Model 


The first two layers of ResNet are the same as those of the GoogLeNet we described before: the 
7x7 convolutional layer with 64 output channels and a stride of 2 is followed by the 3 x 3 maximum 
pooling layer with a stride of 2. The difference is the batch normalization layer added after each 
convolutional layer in ResNet. 


net = nn.Sequential() 

net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3), 
nn.BatchNorm(), nn.Activation('relu'), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules 
made up of residual blocks, each of which uses several residual blocks with the same number 
of output channels. The number of channels in the first module is the same as the number of 
input channels. Since a maximum pooling layer with a stride of 2 has already been used, it is not 
necessary to reduce the height and width. In the first residual block for each of the subsequent 
modules, the number of channels is doubled compared with that of the previous module, and the 
height and width are halved. 


Now, we implement this module. Note that special processing has been performed on the first 
module. 


def resnet_block(num_channels, num_residuals, first_block=False): 
blk = nn.Sequential() 
for i in range(num_residuals): 
if i == @ and not first_block: 
blk.add(Residual(num_channels, use_1x1conv=True, strides=2)) 
else: 
blk. add(Residual (num_channels) ) 
return blk 


Then, we add all the modules to ResNet. Here, two residual blocks are used for each module. 


net.add(resnet_block(64, 2, first_block=True), 
resnet_block(128, 2), 
resnet_block(256, 2), 
resnet_block(512, 2)) 


Finally, just like GoogLeNet, we add a global average pooling layer, followed by the fully-connected 
layer output. 


net.add(nn.GlobalAvgPoo12D(), nn.Dense(10)) 


There are 4 convolutional layers in each module (excluding the 1 x 1 convolutional layer). Together 
with the first 7 x 7 convolutional layer and the final fully-connected layer, there are 18 layers in 
total. Therefore, this model is commonly known as ResNet-18. By configuring different numbers 
of channels and residual blocks in the module, we can create different ResNet models, such as 
the deeper 152-layer ResNet-152. Although the main architecture of ResNet is similar to that of 
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GoogLeNet, ResNet's structure is simpler and easier to modify. All these factors have resulted in 
the rapid and widespread use of ResNet. Fig. 7.6.4 depicts the full ResNet-18. 
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Fig. 7.6.4: The ResNet-18 architecture. 


Before training ResNet, let us observe how the input shape changes across different modules in 
ResNet. As in all the previous architectures, the resolution decreases while the number of chan- 
nels increases up until the point where a global average pooling layer aggregates all features. 


X = np.random.uniform(size=(1, 1, 224, 224)) 


(continues on next page) 





290 Chapter 7. Modern Convolutional Neural Networks 


(continued from previous page) 


net.initialize() 
for layer in net: 
X = layer(X) 
print(layer.name, ‘output shape:\t', X.shape) 


conv5 output shape: (1, 64, 112, 112) 
batchnorm4 output shape: Gly 645 2 112) 
relu@ output shape: (1, 64, 112, 112) 

pool@ output shape: (1, 64, 56, 56) 


sequentiall output shape: (1, 64, 56, 56) 
sequential2 output shape: (o 1285285) 28) 
sequential3 output shape: Gs 2565 145 14) 
sequential4 output shape: (Gi, Gl Te I) 
pooll output shape: (1, 512, 1, 1) 

dense® output shape: (1, 10) 


7.6.4 Training 


We train ResNet on the Fashion-MNIST dataset, just like before. 


lr, num_epochs, batch_size = 0.05, 10, 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size, resize=96) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.012, train acc 0.997, test acc 0.881 
4852.1 examples/sec on gpu(0) 


1.0 

0.8 

0.6 —— train loss 
=== train acc 

0.4 —-= test acc 

0.2 

0.0 
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Summary 


Nested function classes are desirable. Learning an additional layer in deep neural networks 
as an identity function (though this is an extreme case) should be made easy. 


The residual mapping can learn the identity function more easily, such as pushing parame- 
ters in the weight layer to zero. 


We can train an effective deep neural network by having residual blocks. Inputs can forward 
propagate faster through the residual connections across layers. 


ResNet had a major influence on the design of subsequent deep neural networks, both for 
convolutional and sequential nature. 


Exercises 


1. What are the major differences between the Inception block in Fig. 7.4.1 and the residual 
block? After removing some paths in the Inception block, how are they related to each other? 


2. Refer to Table 1 in the ResNet paper (He et al., 2016a) to implement different variants. 


3. For deeper networks, ResNet introduces a “bottleneck” architecture to reduce model com- 
plexity. Try to implement it. 


4. In subsequent versions of ResNet, the authors changed the “convolution, batch normal- 
ization, and activation” structure to the “batch normalization, activation, and convolution” 
structure. Make this improvement yourself. See Figure 1 in (He et al., 2016b) for details. 


5. Why can't we just increase the complexity of functions without bound, even if the function 
classes are nested? 


Discussions”? 


7.7 Densely Connected Networks (DenseNet) 


ResNet significantly changed the view of how to parametrize the functions in deep networks. 
DenseNet (dense convolutional network) is to some extent the logical extension of this (Huang 
et al., 2017). To understand how to arrive at it, let us take a small detour to mathematics. 


7.7.1 From ResNet to DenseNet 


Recall the Taylor expansion for functions. For the point x = 0 it can be written as 


x PO) 2 | PO) a (7.7.1) 


f(z) = f(0) + £'(0) 51 7 Poo 





The key point is that it decomposes a function into increasingly higher order terms. In a similar 
vein, ResNet decomposes functions into 


f(x) =x+qg(x). (7.7.2) 





2 https://discuss.d21.ai/t/85 
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That is, ResNet decomposes f into a simple linear term and a more complex nonlinear one. What 
if we want to capture (not necessarily add) information beyond two terms? One solution was 
DenseNet (Huang et al., 2017). 


Fig. 7.7.1: The main difference between ResNet (left) and DenseNet (right) in cross-layer connec- 
tions: use of addition and use of concatenation. 


As shown in Fig. 7.7.1, the key difference between ResNet and DenseNet is that in the latter case 
outputs are concatenated (denoted by [, )) rather than added. As a result, we perform a mapping 
from x to its values after applying an increasingly complex sequence of functions: 


x > [x, AE), fo([x, fi(x)]), £3 (bx, f(x), fo(lx, £10)))), ---]- (7.7.3) 


In the end, all these functions are combined in MLP to reduce the number of features again. In 
terms of implementation this is quite simple: rather than adding terms, we concatenate them. 
The name DenseNet arises from the fact that the dependency graph between variables becomes 
quite dense. The last layer of such a chain is densely connected to all previous layers. The dense 
connections are shown in Fig. 7.7.2. 





Fig. 7.7.2: Dense connections in DenseNet. 


The main components that compose a DenseNet are dense blocks and transition layers. The for- 
mer define how the inputs and outputs are concatenated, while the latter control the number of 
channels so that it is not too large. 


7.7.2 Dense Blocks 


DenseNet uses the modified “batch normalization, activation, and convolution” structure of 
ResNet (see the exercise in Section 7.6). First, we implement this convolution block structure. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


(continues on next page) 
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def conv_block(num_channels): 
blk = nn.Sequential() 
blk.add(nn.BatchNorm() , 
nn.Activation('relu'), 
nn.Conv2D(num_channels, kernel_size=3, padding=1)) 
return blk 


A dense block consists of multiple convolution blocks, each using the same number of output chan- 
nels. In the forward propagation, however, we concatenate the input and output of each convo- 
lution block on the channel dimension. 


class DenseBlock(nn.Block): 
def __init__(self, num_convs, num_channels, *xkwargs): 
super().__init__(**kwargs) 
self.net = nn.Sequential() 
for _ in range(num_convs): 


self.net.add(conv_block(num_channels) ) 


def forward(self, X): 
for blk in self.net: 
Y = blk(X) 
# Concatenate the input and output of each block on the channel 
# dimension 
X = np.concatenate((X, Y), axis=1) 
return X 


In the following example, we define a DenseBlock instance with 2 convolution blocks of 10 output 
channels. When using an input with 3 channels, we will get an output with 3+2 x10 = 23 channels. 
The number of convolution block channels controls the growth in the number of output channels 
relative to the number of input channels. This is also referred to as the growth rate. 


blk = DenseBlock(2, 10) 
blk. initialize() 
X = np.random.uniform(size=(4, 3, 8, 8)) 


Y = blk(X) 
Y. shape 
Gin y By E) 


7.7.3 Transition Layers 


Since each dense block will increase the number of channels, adding too many of them will lead 
to an excessively complex model. A transition layer is used to control the complexity of the model. 
It reduces the number of channels by using the 1 x 1 convolutional layer and halves the height 
and width of the average pooling layer with a stride of 2, further reducing the complexity of the 
model. 


def transition_block(num_channels) : 
blk = nn.Sequential() 


(continues on next page) 
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(continued from previous page) 


blk.add(nn.BatchNorm(), nn.Activation('relu'), 
nn.Conv2D(num_channels, kernel_size=1), 
nn.AvgPool2D(pool_size=2, strides=2)) 
return blk 


Apply a transition layer with 10 channels to the output of the dense block in the previous example. 
This reduces the number of output channels to 10, and halves the height and width. 


blk = transition_block(10) 
blk. initialize() 
b1k(Y).shape 


(4, 10, 4, 4) 


7.7.4 DenseNet Model 


Next, we will construct a DenseNet model. DenseNet first uses the same single convolutional layer 
and maximum pooling layer as in ResNet. 


net = nn.Sequential() 

net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3), 
nn.BatchNorm(), nn.Activation('relu'), 
nn.MaxPool2D(pool_size=3, strides=2, padding=1)) 


Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet uses 
four dense blocks. Similar to ResNet, we can set the number of convolutional layers used in each 
dense block. Here, we set it to 4, consistent with the ResNet-18 model in Section 7.6. Furthermore, 
we set the number of channels (i.e., growth rate) for the convolutional layers in the dense block 
to 32, so 128 channels will be added to each dense block. 


In ResNet, the height and width are reduced between each module by a residual block with a 
stride of 2. Here, we use the transition layer to halve the height and width and halve the number 
of channels. 


# ‘num_channels*: the current number of channels 
num_channels, growth_rate = 64, 32 
num_convs_in_dense_blocks = [4, 4, 4, 4] 








for i, num_convs in enumerate(num_convs_in_dense_blocks) : 

net.add(DenseBlock(num_convs, growth_rate)) 
# This is the number of output channels in the previous dense block 
num_channels += num_convs * growth_rate 
# A transition layer that halves the number of channels is added between 
# the dense blocks 
if i != len(num_convs_in_dense_blocks) - 1: 

num_channels //= 2 

net .add(transition_block(num_channels) ) 





Similar to ResNet, a global pooling layer and a fully-connected layer are connected at the end to 
produce the output. 
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net.add(nn.BatchNorm(), 
nn.Activation('relu'), 
nn.GlobalAvgPoo12D(), 
nn.Dense(10)) 


7.7.5 Training 


Since we are using a deeper network here, in this section, we will reduce the input height and 
width from 224 to 96 to simplify the computation. 


Ir, num_epochs, batch_size = 0.1, 10, 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size, resize=96) 
d21.train_ch6(net, train_iter, test_iter, num_epochs, Ir) 


loss 0.147, train acc 0.946, test acc 0.915 
5569.8 examples/sec on gpu(Q) 


—— train loss 
=== train acc 
—-- test acc 


0.4 


0.2 





Summary 
* In terms of cross-layer connections, unlike ResNet, where inputs and outputs are added to- 
gether, DenseNet concatenates inputs and outputs on the channel dimension. 
* The main components that compose DenseNet are dense blocks and transition layers. 


+ We need to keep the dimensionality under control when composing the network by adding 
transition layers that shrink the number of channels again. 
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Exercises 


1. Why do we use average pooling rather than maximum pooling in the transition layer? 


2. One of the advantages mentioned in the DenseNet paper is that its model parameters are 
smaller than those of ResNet. Why is this the case? 


3. One problem for which DenseNet has been criticized is its high memory consumption. 


1. Is this really the case? Try to change the input shape to 224 x 224 to see the actual GPU 
memory consumption. 


2. Can you think of an alternative means of reducing the memory consumption? How 
would you need to change the framework? 


4. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper 
(Huang et al., 2017). 


5. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing price 
prediction task in Section 4.10. 


Discussions!” 





1 https://discuss.d21.ai/t/87 
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8 Recurrent Neural Networks 


So far we encountered two types of data: tabular data and image data. For the latter we designed 
specialized layers to take advantage of the regularity in them. In other words, if we were to per- 
mute the pixels in an image, it would be much more difficult to reason about its content of some- 
thing that would look much like the background of a test pattern in the times of analog TV. 


Most importantly, so far we tacitly assumed that our data are all drawn from some distribution, 
and all the examples are independently and identically distributed (i.i.d.). Unfortunately, this is 
not true for most data. For instance, the words in this paragraph are written in sequence, and it 
would be quite difficult to decipher its meaning if they were permuted randomly. Likewise, image 
frames in a video, the audio signal in a conversation, and the browsing behavior on a website, all 
follow sequential order. It is thus reasonable to assume that specialized models for such data will 
do better at describing them. 


Another issue arises from the fact that we might not only receive a sequence as an input but rather 
might be expected to continue the sequence. For instance, the task could be to continue the series 
2,4,6,8,10,... This is quite common in time series analysis, to predict the stock market, the fever 
curve of a patient, or the acceleration needed for a race car. Again we want to have models that 
can handle such data. 


In short, while CNNs can efficiently process spatial information, recurrent neural networks (RNNs) 
are designed to better handle sequential information. RNNs introduce state variables to store past 
information, together with the current inputs, to determine the current outputs. 


Many of the examples for using recurrent networks are based on text data. Hence, we will empha- 
size language models in this chapter. After a more formal review of sequence data we introduce 
practical techniques for preprocessing text data. Next, we discuss basic concepts of a language 
model and use this discussion as the inspiration for the design of RNNs. In the end, we describe 
the gradient calculation method for RNNs to explore problems that may be encountered when 
training such networks. 


8.1 Sequence Models 


Imagine that you are watching movies on Netflix. As a good Netflix user, you decide to rate each 
of the movies religiously. After all, a good movie is a good movie, and you want to watch more of 
them, right? Asitturns out, things are not quite so simple. People's opinions on movies can change 
quite significantly over time. In fact, psychologists even have names for some of the effects: 


e There is anchoring, based on someone else's opinion. For instance, after the Oscar awards, 
ratings for the corresponding movie go up, even though it is still the same movie. This effect 
persists for a few months until the award is forgotten. It has been shown that the effect lifts 
rating by over half a point (Wu et al., 2017). 
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There is the hedonic adaptation, where humans quickly adapt to accept an improved or a 
worsened situation as the new normal. For instance, after watching many good movies, the 
expectations that the next movie is equally good or better are high. Hence, even an average 
movie might be considered as bad after many great ones are watched. 


There is seasonality. Very few viewers like to watch a Santa Claus movie in August. 


In some cases, movies become unpopular due to the misbehaviors of directors or actors in 
the production. 


Some movies become cult movies, because they were almost comically bad. Plan 9 from 
Outer Space and Troll 2 achieved a high degree of notoriety for this reason. 


In short, movie ratings are anything but stationary. Thus, using temporal dynamics led to more ac- 
curate movie recommendations (Koren, 2009). Of course, sequence data are not just about movie 
ratings. The following gives more illustrations. 


Many users have highly particular behavior when it comes to the time when they open apps. 
For instance, social media apps are much more popular after school with students. Stock 
market trading apps are more commonly used when the markets are open. 


Itis much harder to predict tomorrow's stock prices than to fill in the blanks for a stock price 
we missed yesterday, even though both are just a matter of estimating one number. After 
all, foresight is so much harder than hindsight. In statistics, the former (predicting beyond 
the known observations) is called extrapolation whereas the latter (estimating between the 
existing observations) is called interpolation. 


Music, speech, text, and videos are all sequential in nature. Ifwe were to permute them they 
would make little sense. The headline dog bites man is much less surprising than man bites 
dog, even though the words are identical. 


Earthquakes are strongly correlated, i.e., after a massive earthquake there are very likely 
several smaller aftershocks, much more so than without the strong quake. In fact, earth- 
quakes are spatiotemporally correlated, i.e., the aftershocks typically occur within a short 
time span and in close proximity. 


Humans interact with each other in a sequential nature, as can be seen in Twitter fights, 
dance patterns, and debates. 
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8.1.1 Statistical Tools 


We need statistical tools and new deep neural network architectures to deal with sequence data. To 
keep things simple, we use the stock price (FTSE 100 index) illustrated in Fig. 8.1.1 as an example. 





FTSE 100 Index 


























1984 1989 1994 1999 2004 2009 2014 





Fig. 8.1.1: FTSE 100 index over about 30 years. 


Let us denote the prices by x,, i.e., at time step t € Z* we observe price x+. Note that for sequences 
in this text, t will typically be discrete and vary over integers or its subset. Suppose that a trader 
who wants to do well in the stock market on day t predicts x; via 


zi ~v Ple | Pte cet (11): (8.1.1) 


Autoregressive Models 


In order to achieve this, our trader could use a regression model such as the one that we trained in 
Section 3.3. There is just one major problem: the number of inputs, z+_1,..., 71 varies, depending 
on t. That is to say, the number increases with the amount of data that we encounter, and we 
will need an approximation to make this computationally tractable. Much of what follows in this 
chapter will revolve around how to estimate P(x, | x¿_1,...,11) efficiently. In a nutshell it boils 
down to two strategies as follows. 


First, assume that the potentially rather long sequence x;_1,...,2, is not really necessary. In 
this case we might content ourselves with some timespan of length 7 and only use 2¿-1,...,Ti-7 
observations. The immediate benefit is that now the number of arguments is always the same, at 
least for t > 7. This allows us to train a deep network as indicated above. Such models will be 
called autoregressive models, as they quite literally perform regression on themselves. 


The second strategy, shown in Fig. 8.1.2, is to keep some summary h; of the past observations, 
and at the same time update h; in addition to the prediction ĉ+. This leads to models that estimate 
x, with 2; = P(x; | ht) and moreover updates of the form hy = g(ht-1, 21-1). Since h; is never 
observed, these models are also called latent autoregressive models. 
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Fig. 8.1.2: A latent autoregressive model. 


Both cases raise the obvious question of how to generate training data. One typically uses histor- 
ical observations to predict the next observation given the ones up to right now. Obviously we do 
not expect time to stand still. However, a common assumption is that while the specific values of 
x¿ might change, at least the dynamics of the sequence itself will not. This is reasonable, since 
novel dynamics are just that, novel and thus not predictable using data that we have so far. Statis- 
ticians call dynamics that do not change stationary. Regardless of what we do, we will thus get an 
estimate of the entire sequence via 


T 


P(x1,... Dr) =|[P@ | ep ty 3, 21). (8.1.2) 
t=1 


Note that the above considerations still hold if we deal with discrete objects, such as words, rather 
than continuous numbers. The only difference is that in such a situation we need to use a classifier 


rather than a regression model to estimate P(x; | x4-1,...,21). 

Markov Models 

Recall the approximation that in an autoregressive model we use only x2;_1,..., £t- instead of 
XLt-1,---,x1 to estimate x;. Whenever this approximation is accurate we say that the sequence 


satisfies a Markov condition. In particular, if r = 1, we have a first-order Markov model and P(x) is 
given by 
T 


P(z1,... am) = [| Pz: | ee-1) where P(z1 | 20) = P(e): (8.1.3) 
t=1 


Such models are particularly nice whenever x, assumes only a discrete value, since in this case 
dynamic programming can be used to compute values along the chain exactly. For instance, we 
can compute P(x441 | 74-1) efficiently: 


Yo, PlLi+1) Lt, 2—1) 

Pl ee 4) 
e Plz | 2e, 241) P (Le, 24-1) 
7 Pltia) 


= Y P(z | 24) Pes | 24-1) 
Tt 





PL | Xt-1) = 





(8.1.4) 





by using the fact that we only need to take into account a very short history of past observations: 
P(ai41 | £t, 2-1) = P(x141 | zt). Going into details of dynamic programming is beyond the scope 
of this section. Control and reinforcement learning algorithms use such tools extensively. 
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Causality 


In principle, there is nothing wrong with unfolding P(x,,..., xr) in reverse order. After all, by 
conditioning we can always write it via 


1 
P(z1,... £r) = [] Pz. [ET A (8.1.5) 
=F 


In fact, if we have a Markov model, we can obtain a reverse conditional probability distribution, 
too. In many cases, however, there exists a natural direction for the data, namely going forward 
in time. It is clear that future events cannot influence the past. Hence, if we change x+, we may be 
able to influence what happens for x;,; going forward but not the converse. That is, if we change 
xt, the distribution over past events will not change. Consequently, it ought to be easier to explain 
P(a141 | x+) rather than P(2; | 7:41). For instance, it has been shown that in some cases we can 
find 1441 = f (21) +efor some additive noise e, whereas the converse is not true (Hoyer et al., 2009). 
This is great news, since it is typically the forward direction that we are interested in estimating. 
The book by Peters et al. has explained more on this topic (Peters et al., 2017a). We are barely 
scratching the surface of it. 


8.1.2 Training 


After reviewing so many statistical tools, let us try this out in practice. We begin by generating 
some data. To keep things simple we generate our sequence data by using a sine function with 
some additive noise for time steps 1, 2,..., 1000. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, np, npx, gluon, init 
from mxnet.gluon import nn 

npx.set_np() 


T = 1000 # Generate a total of 1000 points 

time = np.arange(1, T + 1, dtype=np.float32) 

x = np.sin(0.01 * time) + np.random.normal(@, 0.2, (T,)) 
d21.plot(time, [x], ‘time’, 'x’, xlim=[1, 1000], figsize=(6, 3)) 
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Next, we need to turn such a sequence into features and labels that our model can train on. Based 
on the embedding dimension 7 we map the data into pairs y; = x and x; = [1;_7,..., 24-1]. The 
astute reader might have noticed that this gives us 7 fewer data examples, since we do not have 
sufficient history for the first r of them. A simple fix, in particular if the sequence is long, is to 
discard those few terms. Alternatively we could pad the sequence with zeros. Here we only use 
the first 600 feature-label pairs for training. 


tau = 4 
features = np.zeros((T - tau, tau)) 
for i in range(tau): 

features[:, i] = xLi:T - tau + i] 
labels = x[tau:].reshape((-1, 1)) 


batch_size, n_train = 16, 600 

# Only the first 'n_train' examples are used for training 

train_iter = d21.load_array((features[:n_train], labels[:n_train]), 
batch_size, is_train=True) 


Here we keep the architecture fairly simple: just an MLP with two fully-connected layers, ReLU 
activation, and squared loss. 


# A simple MLP 
def get_net(): 
net = nn.Sequential() 
net.add(nn.Dense(10, activation='relu'), 
nn.Dense(1)) 
net.initialize(init.Xavier()) 
return net 


# Square loss 
loss = gluon.loss.L2Loss() 


Now we are ready to train the model. The code below is essentially identical to the training loop 
in previous sections, such as Section 3.3. Thus, we will not delve into much detail. 


def train(net, train_iter, loss, epochs, Ir): 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, 
{'learning_rate’: 1r)) 
for epoch in range(epochs): 
for X, y in train_iter: 
with autograd.record(): 
1 = loss(net(X), y) 
1. backward() 
trainer.step(batch_size) 
print(f’epoch {epoch + 1}, ’ 
f'loss: (d21.evaluate_loss(net, train_iter, loss):f)') 


net = get_net() 
train(net, train_iter, loss, 5, 0.01) 


epoch 1, loss: 0.037650 
epoch 2, loss: 0.031607 
epoch 3, loss: 0.028694 


(continues on next page) 
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(continued from previous page) 


epoch 4, loss: 0.027035 
epoch 5, loss: 0.026642 


8.1.3 Prediction 


Since the training loss is small, we would expect our model to work well. Let us see what this 
means in practice. The first thing to check is how well the model is able to predict what happens 
just in the next time step, namely the one-step-ahead prediction. 


onestep_preds = net(features) 

d21.plot([time, time[tau:]], 
[x.asnumpy(), onestep_preds.asnumpy()1], 
‘time’, 
xo, 
legend=['data', '1-step preds'], 
xlim=[1, 1000], 
figsize=(6, 3)) 


— data 
==- 1-step preds 
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The one-step-ahead predictions look nice, just as we expected. Even beyond 604 (n_train + tau) 
observations the predictions still look trustworthy. However, there is just one little problem to 
this: if we observe sequence data only until time step 604, we cannot hope to receive the inputs 
for all the future one-step-ahead predictions. Instead, we need to work our way forward one step 
ata time: 


605 = f (£601, 2602, t603, T604), 
606 = f (£602, £603, L604, E605) 


pi 


(8.1.6) 


( 

( 

ĉ6so7 = f (£603, £604, L605, T606), 

Teos = f (£604, £605, T606, T607), 
( 


609 = f (L605, L606, L607, Leos), 


Generally, for an observed sequence up to z+ its predicted output ĉ++x at time step t + k is called 
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the :math:*k'-step-ahead prediction. Since we have observed up to 2604, its k-step-ahead predic- 
tion is 2604+x. In other words, we will have to use our own predictions to make multistep-ahead 
predictions. Let us see how well this goes. 


multistep_preds = np.zeros(T) 
multistep_preds[:n_train + tau] = x[:n_train + tau] 
for i in range(n_train + tau, T): 
multistep_preds[i] = net(multistep_preds[i - tau:i].reshape((1, -1))) 


d21.plot(L[time, time[tau:], time[n_train + tau:]], [ 
x.asnumpy(), 
onestep_preds.asnumpy(), multistep_preds[n_train + tau: ].asnumpy() 
1), 
‘time’, 
roe 
legend=['data', '1-step preds’, ‘multistep preds'1, 
xlim=[1, 1000], 
figsize=(6, 3)) 


— data 
=== 1-step preds 
—-- multistep preds 
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As the above example shows, this is a spectacular failure. The predictions decay to a constant 
pretty quickly after a few prediction steps. Why did the algorithm work so poorly? This is ulti- 
mately due to the fact that the errors build up. Let us say that after step 1 we have some error 
e, = €. Now the input for step 2 is perturbed by «,, hence we suffer some error in the order of 
€2 = € + ce, for some constant c, and so on. The error can diverge rather rapidly from the true 
observations. This is a common phenomenon. For instance, weather forecasts for the next 24 
hours tend to be pretty accurate but beyond that the accuracy declines rapidly. We will discuss 
methods for improving this throughout this chapter and beyond. 


Let us take a closer look at the difficulties in k-step-ahead predictions by computing predictions 
on the entire sequence for k = 1,4, 16, 64. 


max_steps = 64 
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features = np.zeros((T - tau - max_steps + 1, tau + max_steps)) 
# Column *i* (‘i* < *tau') are observations from 'x' for time steps from 
EOI ta US max steps m Ii 
for i in range(tau): 
features[:, i] = xLi:i + T - tau - max_steps + 1] 


# Column ‘i* (Cil >= 'tau') are the (‘i - tau + 1‘)-step-ahead predictions for 
$ time steps from ‘i + 1‘ to ‘i + T - tau max steps + 1' 
for i in range(tau, tau + max_steps): 

features[:, i] = net(features[:, i - tau:i]).reshape(-1) 


steps = (1, 4, 16, 64) 

d21.plot([time[tau + i - 1:T - max_steps + i] for i in steps], 
[features[:, (tau + i - 1)].asnumpy() for i in steps], 
"time’, 
ron 
legend=[f'{i}-step preds' for i in steps], 
xlim=[5, 1000], 
figsize=(6, 3)) 


x 


| FPE 


1-step preds 
==- 4-step preds 
—-- 16-step preds 
64-step preds 


— 
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This clearly illustrates how the quality of the prediction changes as we try to predict further into 
the future. While the 4-step-ahead predictions still look good, anything beyond that is almost 
useless. 


Summary 


e There is quite a difference in difficulty between interpolation and extrapolation. Conse- 
quently, if you have a sequence, always respect the temporal order of the data when training, 
i.e., never train on future data. 


e Sequence models require specialized statistical tools for estimation. Two popular choices 
are autoregressive models and latent-variable autoregressive models. 


e For causal models (e.g., time going forward), estimating the forward direction is typically a 
lot easier than the reverse direction. 
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e For an observed sequence up to time step t, its predicted output at time step t + k is the k- 
step-ahead prediction. As we predict further in time by increasing k, the errors accumulate 
and the quality of the prediction degrades, often dramatically. 


Exercises 


1. Improve the model in the experiment of this section. 
1. Incorporate more than the past 4 observations? How many do you really need? 


2. How many past observations would you need ifthere was no noise? Hint: you can write 
sin and cos as a differential equation. 


3. Can you incorporate older observations while keeping the total number of features con- 
stant? Does this improve accuracy? Why? 


4. Change the neural network architecture and evaluate the performance. 


2. An investor wants to find a good security to buy. He looks at past returns to decide which 
one is likely to do well. What could possibly go wrong with this strategy? 


3. Does causality also apply to text? To which extent? 


4. Give an example for when a latent autoregressive model might be needed to capture the 
dynamic of the data. 


Discussions!” 


8.2 Text Preprocessing 


We have reviewed and evaluated statistical tools and prediction challenges for sequence data. 
Such data can take many forms. Specifically, as we will focus on in many chapters of the book, 
text is one of the most popular examples of sequence data. For example, an article can be simply 
viewed as a sequence of words, or even a sequence of characters. To facilitate our future experi- 
ments with sequence data, we will dedicate this section to explain common preprocessing steps 
for text. Usually, these steps are: 


. Load text as strings into memory. 
. Split strings into tokens (e.g., words and characters). 


. Build a table of vocabulary to map the split tokens to numerical indices. 


BR 0 N Be 


. Convert text into sequences of numerical indices so they can be manipulated by models 
easily. 


import collections 
from d21 import mxnet as d21 
import re 
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8.2.1 Reading the Dataset 


To get started we load text from H. G. Wells’ The Time Machine!*”, This is a fairly small corpus of 
just over 30000 words, but for the purpose of what we want to illustrate this is just fine. More realis- 
tic document collections contain many billions of words. The following function reads the dataset 
into a list of text lines, where each line is a string. For simplicity, here we ignore punctuation and 
capitalization. 


#@save 
d21.DATA_HUB[ 'time_machine'] = (d21.DATA_URL + 'timemachine.txt', 
"090b5e7e70c295757f55df93cb0a180b9691891a') 


def read_time_machine(): #@save 
""*"Load the time machine dataset into a list of text lines. 
with open(d21.download('time_machine'), 'r') as f: 
lines = f.readlines() 
return [re.sub('[*A-Za-z]+', * ', line).strip().lower() for line in lines] 


nnn 


lines = read_time_machine() 
print(f’# text lines: (len(lines))') 
print(lines[0]) 

print(lines[10]) 


# text lines: 3221 
the time machine by h g wells 
twinkled and his usually pale face was flushed and animated the 


8.2.2 Tokenization 


The following tokenize function takes a list (lines) as the input, where each list is a text sequence 
(e.g., a text line). Each text sequence is split into a list of tokens. A token is the basic unit in text. 
In the end, a list of token lists are returned, where each token is a string. 


def tokenize(lines, token='word'): #@save 
"""Split text lines into word or character tokens. 


nnn 


if token == 'word': 

return [line.split() for line in lines] 
elif token == 'char': 

return [list(line) for line in lines] 
else: 


print('ERROR: unknown token type: ' + token) 


tokens = tokenize(lines) 
for i in range(11): 
print(tokens[i]) 


MAMA Machi DA A cr ells] 
[7 
[El 
[7 


(continues on next page) 
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(continued from previous page) 


[7 

[aie 

C] 

[7 

['the', 'time', 'traveller', 'for', 'so', 'it', 'will', ‘be’, 'convenient', 'to', 'speak', 
SOF 5 UM 

['was', 'expounding', kar 'recondite', ‘matter’, ‘to’, 'us', ‘his’, ‘grey’, ‘eyes’, ‘shone’, 
> 'and'] 

['twinkled’, ‘and’, 'his', ‘usually’, 'pale', 'face', 'was', 'flushed', ‘and’, ‘animated’, 
>'the'] 


8.2.3 Vocabulary 


The string type of the token is inconvenient to be used by models, which take numerical inputs. 
Now let us build a dictionary, often called vocabulary as well, to map string tokens into numerical 
indices starting from 0. To do so, we first count the unique tokens in all the documents from the 
training set, namely a corpus, and then assign a numerical index to each unique token according to 
its frequency. Rarely appeared tokens are often removed to reduce the complexity. Any token that 
does not exist in the corpus or has been removed is mapped into a special unknown token “<unk>”. 
We optionally add a list of reserved tokens, such as “<pad>” for padding, “<bos>” to present the 
beginning for a sequence, and “<eos>” for the end of a sequence. 


class Vocab: #@save 
"""Vocabulary for text. 
def __init__(self, tokens=None, min_freq=0, reserved_tokens=None) : 
if tokens is None: 
tokens = [] 
if reserved_tokens is None: 
reserved_tokens = [] 
# Sort according to frequencies 
counter = count_corpus(tokens) 
self.token_freqs = sorted(counter.items(), key=lambda x: x[1], 
reverse=True) 
# The index for the unknown token is Q 
self.unk, unig_tokens = 0, ['<unk>'] + reserved_tokens 
unig_tokens += [token for token, freq in self.token_freqs 
if freq >= min_freq and token not in unig_tokens] 
self.idx_to_token, self.token_to_idx = [], dict() 
for token in uniq_tokens: 
self .idx_to_token. append(token) 
self.token_to_idx[token] = len(self.idx_to_token) - 1 


nun 


def __len__(self): 
return len(self.idx_to_token) 


def __getitem__(self, tokens): 
if not isinstance(tokens, (list, tuple)): 
return self.token_to_idx.get(tokens, self.unk) 
return [self.__getitem__(token) for token in tokens] 


def to_tokens(self, indices): 
if not isinstance(indices, (list, tuple)): 


(continues on next page) 
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return self.idx_to_token[indices] 
return [self.idx_to_token[index] for index in indices] 


def count_corpus(tokens): #@save 
"""Count token frequencies.””" 
# Here ‘tokens* is a 1D list or 2D list 
if len(tokens) == @ or isinstance(tokens[0], list): 
# Flatten a list of token lists into a list of tokens 
tokens = [token for line in tokens for token in line] 
return collections.Counter (tokens) 


We construct a vocabulary using the time machine dataset as the corpus. Then we print the first 
few frequent tokens with their indices. 


vocab = Vocab(tokens) 
print(list (vocab. token_to_idx.items())[:10]) 


LC ine’, 0), Cunas”, Wy Cl’, Dy Camas De Cory 4D, Cas Do CO Do CMS", Doe 
Cin, 8), (that, D] 


Now we can convert each text line into a list of numerical indices. 


for i in [o, 10]: 
print('words:”', tokens[i]) 
print('indices:', vocab[tokens[i]]) 


words: ['the', ‘time’, 'machine’, ‘by’, 'h', 'g', 'wells'] 

indices: [1, 19, 50, 40, 2183, 2184, 400] 

words: ['twinkled', 'and', ‘his’, ‘usually’, ‘pale’, 'face', 'was', 'flushed', 'and', 
>'animated', 'the'] 

indices: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1] 


8.2.4 Putting All Things Together 


Using the above functions, we package everything into the load_corpus_time_machine function, 
which returns corpus, a list of token indices, and vocab, the vocabulary of the time machine cor- 
pus. The modifications we did here are: i) we tokenize text into characters, not words, to simplify 
the training in later sections; ii) corpus is a single list, not a list of token lists, since each text line 
in the time machine dataset is not necessarily a sentence or a paragraph. 


def load_corpus_time_machine(max_tokens=-1): #@save 
""*"Return token indices and the vocabulary of the time machine dataset. 
lines = read_time_machine() 
tokens = tokenize(lines, 'char') 
vocab = Vocab(tokens) 
# Since each text line in the time machine dataset is not necessarily a 
# sentence or a paragraph, flatten all the text lines into a single list 
corpus = [vocab[token] for line in tokens for token in line] 
if max_tokens > 0: 


nnn 


(continues on next page) 
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corpus = corpus[ :max_tokens] 
return corpus, vocab 


corpus, vocab = load_corpus_time_machine() 
len(corpus), len(vocab) 


(170580, 28) 


Summary 


e Text is an important form of sequence data. 


+ To preprocess text, we usually split text into tokens, build a vocabulary to map token strings 
into numerical indices, and convert text data into token indices for models to manipulate. 


Exercises 


1. Tokenization is a key preprocessing step. It varies for different languages. Try to find another 
three commonly used methods to tokenize text. 


2. Inthe experiment of this section, tokenize text into words and vary the min_freq arguments 
of the Vocab instance. How does this affect the vocabulary size? 


Discussions! 


8.3 Language Models and the Dataset 


In Section 8.2, we see how to map text data into tokens, where these tokens can be viewed as a 
sequence of discrete observations, such as words or characters. Assume that the tokens in a text 
sequence of length T are in turn x), 22,...,u7. Then, in the text sequence, 7,(1 < t < T) can 
be considered as the observation or label at time step t. Given such a text sequence, the goal of a 
language model is to estimate the joint probability of the sequence 


Plena 2T). (8.3.1) 


Language models are incredibly useful. For instance, an ideal language model would be able 
to generate natural text just on its own, simply by drawing one token at a time 1, ~ P(x, | 
Xt-1,---,% 1). Quite unlike the monkey using a typewriter, all text emerging from such a model 
would pass as natural language, e.g., English text. Furthermore, it would be sufficient for gener- 
ating a meaningful dialog, simply by conditioning the text on previous dialog fragments. Clearly 
we are still very far from designing such a system, since it would need to understand the text rather 
than just generate grammatically sensible content. 


Nonetheless, language models are of great service even in their limited form. For instance, the 
phrases “to recognize speech” and “to wreck a nice beach” sound very similar. This can cause am- 
biguity in speech recognition, which is easily resolved through a language model that rejects the 
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second translation as outlandish. Likewise, in a document summarization algorithm it is worth- 
while knowing that “dog bites man” is much more frequent than “man bites dog”, or that “I want 
to eat grandma” is a rather disturbing statement, whereas “I want to eat, grandma” is much more 
benign. 


8.3.1 Learning a Language Model 


The obvious question is how we should model a document, or even a sequence of tokens. Suppose 
that we tokenize text data at the word level. We can take recourse to the analysis we applied to 
sequence models in Section 8.1. Let us start by applying basic probability rules: 


T 
P(x£1,£2,..., 2T) =|[P@ [Bais avg T1). (8.3.2) 
t=1 


For example, the probability of a text sequence containing four words would be given as: 


P(deep, learning, is, fun) = P(deep)P(learning | deep) P(is | deep, learning) P(fun | deep, learning, is). 
(8.3.3) 


In order to compute the language model, we need to calculate the probability of words and the 
conditional probability of a word given the previous few words. Such probabilities are essentially 
language model parameters. 


Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries, 
Project Gutenberg), and all text posted on the Web. The probability of words can be calculated 
from the relative word frequency of a given word in the training dataset. For example, the estimate 
P(deep) can be calculated as the probability of any sentence starting with the word “deep”. A 
slightly less accurate approach would be to count all occurrences of the word “deep” and divide it 
by the total number of words in the corpus. This works fairly well, particularly for frequent words. 
Moving on, we could attempt to estimate 


n(deep, learning) 
n(deep) 





P(learning | deep) = , (8.3.4) 
where n(x) and n(x, x’) are the number of occurrences of singletons and consecutive word pairs, 
respectively. Unfortunately, estimating the probability of a word pair is somewhat more difficult, 
since the occurrences of “deep learning” are a lot less frequent. In particular, for some unusual 
word combinations it may be tricky to find enough occurrences to get accurate estimates. Things 
take a turn for the worse for three-word combinations and beyond. There will be many plausi- 
ble three-word combinations that we likely will not see in our dataset. Unless we provide some 
solution to assign such word combinations nonzero count, we will not be able to use them in a 
language model. If the dataset is small or if the words are very rare, we might not find even a 
single one of them. 


A common strategy is to perform some form of Laplace smoothing. The solution is to add a small 
constant to all counts. Denote by n the total number of words in the training set and m the number 
of unique words. This solution helps with singletons, e.g., via 





Pe |a) = Mec tele) (8.3.5) 


a, a,x") + 63P(x") 
n(x, x’) + €3 





Pla" (E nt 





10% https://en.wikipedia.org/wiki/Project_Gutenberg 
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Here e, €2, and ez are hyperparameters. Take e, as an example: when e; = 0, no smoothing is 
applied; when e, approaches positive infinity, P(x) approaches the uniform probability 1/m. The 
above is a rather primitive variant of what other techniques can accomplish (Wood et al., 2011). 


Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, we 
need to store all counts. Second, this entirely ignores the meaning of the words. For instance, 
“cat” and “feline” should occur in related contexts. It is quite difficult to adjust such models to 
additional contexts, whereas, deep learning based language models are well suited to take this 
into account. Last, long word sequences are almost certain to be novel, hence a model that simply 
counts the frequency of previously seen word sequences is bound to perform poorly there. 


8.3.2 Markov Models and n-grams 


Before we discuss solutions involving deep learning, we need some more terminology and con- 
cepts. Recall our discussion of Markov Models in Section 8.1. Let us apply this to language 
modeling. A distribution over sequences satisfies the Markov property of first order if P(x:+1 | 
Xt,.--,1) = P(xe41 | a). Higher orders correspond to longer dependencies. This leads to a 
number of approximations that we could apply to model a sequence: 


Pi T2, T3, LA) = P(ax1)P(a2)P(a3)P(x4), 
P(x1, £2, £3, £4) = Plan Ple | x1)P(x3 | x2)P(x4 | 23), (8.3.6) 
P(x1, £2, £3, £4) = Ple Ple] 1) Ple | £1, oo) Ple | £2, £3). 


The probability formulae that involve one, two, and three variables are typically referred to as 
unigram, bigram, and trigram models, respectively. In the following, we will learn how to design 
better models. 


8.3.3 Natural Language Statistics 


Let us see how this works on real data. We construct a vocabulary based on the time machine 
dataset as introduced in Section 8.2 and print the top 10 most frequent words. 


from d21 import mxnet as d21 
from mxnet import np, npx 
import random 

npx.set_np() 


tokens = d21.tokenize(d21.read_time_machine()) 

# Since each text line is not necessarily a sentence or a paragraph, we 
# concatenate all text lines 

corpus = [token for line in tokens for token in line] 

vocab = d21.Vocab(corpus) 

vocab. token_freqs[:10] 


[(’the'’, 2261), 
Gigs O, 
C’and’, 1245), 
Car, MISS): 
(MS O) 
(ECO O) 


(continues on next page) 
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(continued from previous page) 
C’was', 552), 
(CAN A 
('that', 443), 
Cmy”, 4403] 


As we can see, the most popular words are actually quite boring to look at. They are often referred 
to as stop words and thus filtered out. Nonetheless, they still carry meaning and we will still use 
them. Besides, it is quite clear that the word frequency decays rather rapidly. The 10% most fre- 
quent word is less than 1/5 as common as the most popular one. To get a better idea, we plot the 
figure of the word frequency. 


freqs = [freq for token, freq in vocab.token_freqs] 
d21.plot(fregs, xlabel='token: x', ylabel='’frequency: n(x)’, 
xscale='log’, yscale='log') 


103 
102 


101 


frequency: n(x) 


10° 


token: x 


We are on to something quite fundamental here: the word frequency decays rapidly in a well- 
defined way. After dealing with the first few words as exceptions, all the remaining words roughly 
follow a straight line on a log-log plot. This means that words satisfy Zipf’s law, which states that 
the frequency n; of the i® most frequent word is: 


1 
Ni X m (8.3.7) 
which is equivalent to 
logn; = —a logi + c, (8.3.8) 


where a is the exponent that characterizes the distribution and cis a constant. This should already 
give us pause if we want to model words by count statistics and smoothing. After all, we will 
significantly overestimate the frequency of the tail, also known as the infrequent words. But what 
about the other word combinations, such as bigrams, trigrams, and beyond? Let us see whether 
the bigram frequency behaves in the same manner as the unigram frequency. 


bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])] 
bigram_vocab = d2l.Vocab(bigram_tokens) 
bigram_vocab.token_freqs[:10] 
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Ka a che) ras 09) 
Cin", me Yes O). 
(CA Ea. Gr 
(Ca. MES DY) 1112), 
(andie then Tor 
CEE 5 ene D LOZA 
Ue", ES Da SED), 
(Cia. “ue Ye 1) 
(Cas) PIN, TD, 
UCA. CY) VEDI 


One thing is notable here. Out of the ten most frequent word pairs, nine are composed of both stop 
words and only one is relevant to the actual book—“the time”. Furthermore, let us see whether the 


trigram frequency behaves in the same manner. 


trigram_tokens = [triple for triple in zip( 
corpus[:-2], corpus[1:-1], corpus[2:])] 

trigram_vocab = d21.Vocab(trigram_tokens) 

trigram_vocab. token_freqs[:10] 


[(Cthe', ‘time’, 'traveller'), 59), 
(C'the’, ‘time’, 'machine’), 30), 
(('the', ‘medical’, 'man'), 24), 
(C'it', 'seemed', 'to'), 16), 
CA SI) US), 

(C’here’, 'and', 'there'), 15), 
(('seemed', 'to', 'me'), 14), 
(Cat. “cier. “nor Y, 
(CAS AE NET) wed), 
(CA besaneus con) AS) 


Last, let us visualize the token frequency among these three models: 


trigrams. 


bigram_fregs = [freq for token, freq in bigram_vocab.token_freqs] 

trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs] 

d21.plot(Lfreqs, bigram_freqs, trigram_freqs], xlabel='token: x’, 
ylabel='frequency: n(x)’, xscale='log’, yscale='log’, 


legend=['unigram', 'bigram', 'trigram']) 


unigrams, bigrams, and 
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This figure is quite exciting for a number of reasons. First, beyond unigram words, sequences of 
words also appear to be following Zipf’s law, albeit with a smaller exponent a in (8.3.7), depending 
on the sequence length. Second, the number of distinct n-grams is not that large. This gives us 
hope that there is quite a lot of structure in language. Third, many n-grams occur very rarely, 
which makes Laplace smoothing rather unsuitable for language modeling. Instead, we will use 
deep learning based models. 


8.3.4 Reading Long Sequence Data 


Since sequence data are by their very nature sequential, we need to address the issue of process- 
ing it. We did so in a rather ad-hoc manner in Section 8.1. When sequences get too long to be 
processed by models all at once, we may wish to split such sequences for reading. Now let us de- 
scribe general strategies. Before introducing the model, let us assume that we will use a neural 
network to train a language model, where the network processes a minibatch of sequences with 
predefined length, say n time steps, at a time. Now the question is how to read minibatches of 
features and labels at random. 


To begin with, since a text sequence can be arbitrarily long, such as the entire The Time Machine 
book, we can partition such a long sequence into subsequences with the same number of time 
steps. When training our neural network, a minibatch of such subsequences will be fed into the 
model. Suppose that the network processes a subsequence of n time steps at atime. Fig. 8.3.1 
shows all the different ways to obtain subsequences from an original text sequence, where n = 5 
and a token at each time step corresponds to a character. Note that we have quite some freedom 
since we could pick an arbitrary offset that indicates the initial position. 
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the time machine by h g wells 
[the Hime machine by h g wells 
he time mąchine| by h| g wells 
the time machine by h g wellls 
the] time machine by h g| wellls 
the 


Fig. 8.3.1: Different offsets lead to different subsequences when splitting up text. 


Hence, which one should we pick from Fig. 8.3.1? In fact, all of them are equally good. However, if 
we pick just one offset, there is limited coverage of all the possible subsequences for training our 
network. Therefore, we can start with a random offset to partition a sequence to get both coverage 
and randomness. In the following, we describe how to accomplish this for both random sampling 
and sequential partitioning strategies. 


Random Sampling 


In random sampling, each example is a subsequence arbitrarily captured on the original long 
sequence. The subsequences from two adjacent random minibatches during iteration are not 
necessarily adjacent on the original sequence. For language modeling, the target is to predict the 
next token based on what tokens we have seen so far, hence the labels are the original sequence, 
shifted by one token. 


The following code randomly generates a minibatch from the data each time. Here, the argument 
batch_size specifies the number of subsequence examples in each minibatch and num_steps is 
the predefined number of time steps in each subsequence. 


def seq_data_iter_random(corpus, batch_size, num_steps): #@save 
"""Generate a minibatch of subsequences using random sampling. 
# Start with a random offset (inclusive of ‘num_steps - 1‘) to partition a 
# sequence 
corpus = corpus[random.randint(0, num_steps - 1):] 
# Subtract 1 since we need to account for labels 
num_subseqs = (len(corpus) - 1) // num_steps 
# The starting indices for subsequences of length 'num_steps' 
initial_indices = list(range(0, num_subsegs * num_steps, num_steps)) 
# In random sampling, the subsequences from two adjacent random 
# minibatches during iteration are not necessarily adjacent on the 
# original sequence 
random. shuffle(initial_indices) 


nnn 


def data(pos): 
# Return a sequence of length `num_steps` starting from ‘pos* 
return corpus[pos: pos + num_steps] 


num_batches = num_subsegs // batch_size 
for i in range(0, batch_size * num_batches, batch_size): 


(continues on next page) 
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(continued from previous page) 


# Here, ‘initial_indices* contains randomized starting indices for 
# subsequences 

initial_indices_per_batch = initial_indices[i: i + batch_size] 

X = [data(j) for j in initial_indices_per_batch] 

Y = [data(j + 1) for j in initial_indices_per_batch] 

yield np.array(X), np.array(Y) 


Let us manually generate a sequence from 0 to 34. We assume that the batch size and numbers of 
time steps are 2 and 5, respectively. This means that we can generate | (35—1)/5| = 6 feature-label 
subsequence pairs. With a minibatch size of 2, we only get 3 minibatches. 


my_seq = list(range(35)) 
for X, Y in seq_data_iter_random(my_seq, batch_size=2, num_steps=5): 
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Sequential Partitioning 


In addition to random sampling of the original sequence, we can also ensure that the subse- 
quences from two adjacent minibatches during iteration are adjacent on the original sequence. 
This strategy preserves the order of split subsequences when iterating over minibatches, hence is 
called sequential partitioning. 


def seq_data_iter_sequential(corpus, batch_size, num_steps): #@save 

"""Generate a minibatch of subsequences using sequential partitioning. 
# Start with a random offset to partition a sequence 
offset = random.randint(@, num_steps) 
num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size 
Xs = np.array(corpusLoffset: offset + num_tokens]) 
Ys = np.array(corpus[loffset + 1: offset + 1 + num_tokens]) 
Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1) 
num_batches = Xs.shape[1] // num_steps 
for i in range(0, num_steps * num_batches, num_steps): 

X = XsL:, i: i + num_steps] 

Y = Ys[:, i: i + num_steps] 

yield X, Y 


nnn 


Using the same settings, let us print features X and labels Y for each minibatch of subsequences 
read by sequential partitioning. Note that the subsequences from two adjacent minibatches dur- 
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ing iteration are indeed adjacent on the original sequence. 


for X, Y in seq_data_iter_sequential(my_seq, batch_size=2, num_steps=5): 
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Now we wrap the above two sampling functions to a class so that we can use it as a data iterator 
later. 


class SeqDataLoader: #@save 
"""An iterator to load sequence data. 
def __init__(self, batch_size, num_steps, use_random_iter, max_tokens): 
if use_random_iter: 
self.data_iter_fn = d21.seq_data_iter_random 
else: 
self.data_iter_fn = d21.seq_data_iter_sequential 
self.corpus, self.vocab = d21.load_corpus_time_machine(max_tokens) 
self.batch_size, self.num_steps = batch_size, num_steps 


nnn 


def __iter__(self): 
return self.data_iter_fn(self.corpus, self.batch_size, self.num_steps) 


Last, we define a function load_data_time_machine that returns both the data iterator and the 
vocabulary, so we can use it similarly as other other functions with the load_data prefix, such as 
d21.load_data_fashion_mnist defined in Section 3.5. 


def load_data_time_machine(batch_size, num_steps, #@save 
use_random_iter=False, max_tokens=10000): 
""*"Return the iterator and the vocabulary of the time machine dataset. 
data_iter = SeqDataLoader ( 
batch_size, num_steps, use_random_iter, max_tokens) 
return data_iter, data_iter.vocab 


nnn 
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Summary 


Language models are key to natural language processing. 


n-grams provide a convenient model for dealing with long sequences by truncating the de- 
pendence. 


Long sequences suffer from the problem that they occur very rarely or never. 


Zipf’s law governs the word distribution for not only unigrams but also the other n-grams. 


There is a lot of structure but not enough frequency to deal with infrequent word combina- 
tions efficiently via Laplace smoothing. 


The main choices for reading long sequences are random sampling and sequential parti- 
tioning. The latter can ensure that the subsequences from two adjacent minibatches during 
iteration are adjacent on the original sequence. 


Exercises 
1. Suppose there are 100,000 words in the training dataset. How much word frequency and 
multi-word adjacent frequency does a four-gram need to store? 
. How would you model a dialogue? 
. Estimate the exponent of Zipf’s law for unigrams, bigrams, and trigrams. 


. What other methods can you think of for reading long sequence data? 


a Aà ù N 


. Consider the random offset that we use for reading long sequences. 
1. Why is it a good idea to have a random offset? 


2. Does it really lead to a perfectly uniform distribution over the sequences on the docu- 
ment? 


3. What would you have to do to make things even more uniform? 


6. If we want a sequence example to be a complete sentence, what kind of problem does this 
introduce in minibatch sampling? How can we fix the problem? 


Discussions! 


8.4 Recurrent Neural Networks 


In Section 8.3 we introduced n-gram models, where the conditional probability of word x; at time 
step t only depends on the n — 1 previous words. If we want to incorporate the possible effect of 
words earlier than time step t — (n — 1) on z+, we need to increase n. However, the number of 
model parameters would also increase exponentially with it, as we need to store |V|” numbers for 
a vocabulary set Y. Hence, rather than modeling P(x; | 74-1,...,2+~n41) itis preferable to use a 
latent variable model: 


P(e | Tt1>.-. , 21) xX P(x | hi); (8.4.1) 
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where hı is a hidden state (also known as a hidden variable) that stores the sequence information 
up to time step t — 1. In general, the hidden state at any time step t could be computed based on 
both the current input x, and the previous hidden state h,_: 


hi = f (at, hi1). (8.4.2) 


For a sufficiently powerful function f in (8.4.2), the latent variable model is not an approximation. 
After all, h; may simply store all the data it has observed so far. However, it could potentially make 
both computation and storage expensive. 


Recall that we have discussed hidden layers with hidden units in Chapter 4. It is noteworthy that 
hidden layers and hidden states refer to two very different concepts. Hidden layers are, as ex- 
plained, layers that are hidden from view on the path from input to output. Hidden states are 
technically speaking inputs to whatever we do at a given step, and they can only be computed by 
looking at data at previous time steps. 


Recurrent neural networks (RNNs) are neural networks with hidden states. Before introducing the 
RNN model, we first revisit the MLP model introduced in Section 4.1. 


8.4.1 Neural Networks without Hidden States 


Let us take a look at an MLP with a single hidden layer. Let the hidden layer's activation function 
be ¢. Given a minibatch of examples X € R”*Y with batch size n and d inputs, the hidden layer's 
output H e R”*? is calculated as 


H = p(XW e» + bh). (8.4.3) 


In (8.4.3), we have the weight parameter W.,, € Rx”, the bias parameter b, € ¡R**”, and the 
number of hidden units h, for the hidden layer. Thus, broadcasting (see Section 2.1.3) is applied 
during the summation. Next, the hidden variable H is used as the input of the output layer. The 
output layer is given by 


O = HW,,, +b,, (8.4.4) 


where O € R”*1 is the output variable, Wp € R”*1 is the weight parameter, and b} € R!*‘ is 
the bias parameter of the output layer. If it is a classification problem, we can use softmax(O) to 
compute the probability distribution of the output categories. 


This is entirely analogous to the regression problem we solved previously in Section 8.1, hence 
we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the 
parameters of our network via automatic differentiation and stochastic gradient descent. 


8.4.2 Recurrent Neural Networks with Hidden States 


Matters are entirely different when we have hidden states. Let us look at the structure in some 
more detail. 


Assume that we have a minibatch of inputs X, € R"*? at time step t. In other words, for a mini- 
batch of n sequence examples, each row of X; corresponds to one example at time step t from 
the sequence. Next, denote by H; e R”*” the hidden variable of time step t. Unlike the MLP, 
here we save the hidden variable H;_; from the previous time step and introduce a new weight 
parameter W,» € R’*" to describe how to use the hidden variable of the previous time step in the 
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current time step. Specifically, the calculation of the hidden variable of the current time step is 
determined by the input of the current time step together with the hidden variable of the previous 
time step: 


H: = P(X,W. + Hi-1 Wy», + bp). (8.4.5) 


Compared with (8.4.3), (8.4.5) adds one more term H;—ı Wp, and thus instantiates (8.4.2). From 
the relationship between hidden variables H; and H;_, of adjacent time steps, we know that these 
variables captured and retained the sequence’s historical information up to their current time 
step, just like the state or memory of the neural network’s current time step. Therefore, such a 
hidden variable is called a hidden state. Since the hidden state uses the same definition of the 
previous time step in the current time step, the computation of (8.4.5) is recurrent. Hence, neural 
networks with hidden states based on recurrent computation are named recurrent neural networks. 
Layers that perform the computation of (8.4.5) in RNNs are called recurrent layers. 


There are many different ways for constructing RNNs. RNNs with a hidden state defined by (8.4.5) 
are very common. For time step t, the output of the output layer is similar to the computation in 
the MLP: 


O, = HW; + by. (8.4.6) 


Parameters of the RNN include the weights W,, € RÌ”, Wapa € R’*", and the bias b, € R'*” 
of the hidden layer, together with the weights Waq € R”*1 and the bias b, € R‘*? of the output 
layer. It is worth mentioning that even at different time steps, RNNs always use these model pa- 
rameters. Therefore, the parameterization cost of an RNN does not grow as the number of time 
steps increases. 


Fig. 8.4.1 illustrates the computational logic of an RNN at three adjacent time steps. At any time 
step t, the computation of the hidden state can be treated as: i) concatenating the input X; at 
the current time step t and the hidden state H;_; at the previous time step t — 1; ii) feeding the 
concatenation result into a fully-connected layer with the activation function ¢. The output of 
such a fully-connected layer is the hidden state H; of the current time step ¢. In this case, the 
model parameters are the concatenation of W,;, and Wp, and a bias of b,,, all from (8.4.5). The 
hidden state of the current time step t, H,, will participate in computing the hidden state H,+, of 
the next time step t + 1. What is more, H; will also be fed into the fully-connected output layer to 
compute the output O, of the current time step t. 
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Fig. 8.4.1: An RNN with a hidden state. 
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We just mentioned that the calculation of XW, + H,_¡W,, for the hidden state is equivalent to 
matrix multiplication of concatenation of X, and H;_; and concatenation of Wz, and Waa. Though 
this can be proven in mathematics, in the following we just use a simple code snippet to show this. 
To begin with, we define matrices X, W_xh, H, and W_hh, whose shapes are (3, 1), (1, 4), (3, 4), and 
(4, 4), respectively. Multiplying X by W_xh, and H by W_hh, respectively, and then adding these two 
multiplications, we obtain a matrix of shape (3, 4). 


from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


X, W_xh = np.random.normal(@, 1, (3, 1)), np.random.normal(0, 1, (1, 4)) 
H, W_hh = np.random.normal(@, 1, (3, 4)), np.random.normal(0, 1, (4, 4)) 
np.dot(X, W_xh) + np.dot(H, W_hh) 


array([[-0.21952868, 4.256434 , 4.5812645 , -5.344988 J, 
[ 3.4478583 , -3.0177274 , -1.6777471 , 7.535347 l, 
[ 2.239007 , 1.4199957 , 4.744728 , -8.421293 ]]) 


Now we concatenate the matrices X and H along columns (axis 1), and the matrices W_xh and W_hh 
along rows (axis 0). These two concatenations result in matrices of shape (3, 5) and of shape (5, 
4), respectively. Multiplying these two concatenated matrices, we obtain the same output matrix 
of shape (3, 4) as above. 


np.dot(np.concatenate((X, H), 1), np.concatenate((W_xh, W_hh), 0)) 


array([[-0.2195287, 4.256434 , 4.5812645, -5.344988 ], 
[ 3.4478583, -3.0177271, -1.677747 , 7.535347 ], 
[ 2.2390068, 1.4199957, 4.744728 , -8.421294 ]]) 


8.4.3 RNN-based Character-Level Language Models 


Recall that for language modeling in Section 8.3, we aim to predict the next token based on the 
current and past tokens, thus we shift the original sequence by one token as the labels. Bengio 
et al. first proposed to use a neural network for language modeling (Bengio et al., 2003). In the 
following we illustrate how RNNs can be used to build a language model. Let the minibatch size 
be one, and the sequence of the text be “machine”. To simplify training in subsequent sections, we 
tokenize text into characters rather than words and consider a character-level language model. Fig. 
8.4.2 demonstrates how to predict the next character based on the current and previous characters 
via an RNN for character-level language modeling. 
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Fig. 8.4.2: A character-level language model based on the RNN. The input and label sequences are 
“machin” and “achine”, respectively. 


During the training process, we run a softmax operation on the output from the output layer for 
each time step, and then use the cross-entropy loss to compute the error between the model output 
and the label. Due to the recurrent computation of the hidden state in the hidden layer, the output 
of time step 3 in Fig. 8.4.2, O3, is determined by the text sequence “m”, “a”, and “c”. Since the next 
character of the sequence in the training data is “h”, the loss of time step 3 will depend on the 

» 69 


probability distribution of the next character generated based on the feature sequence “m”, “a”, 
“c” and the label “h” of this time step. 


In practice, each token is represented by a d-dimensional vector, and we use a batch size n > 1. 
Therefore, the input X; at time step t will be a n x d matrix, which is identical to what we discussed 
in Section 8.4.2. 


8.4.4 Perplexity 


Last, let us discuss about how to measure the language model quality, which will be used to eval- 
uate our RNN-based models in the subsequent sections. One way is to check how surprising the 
textis. A good language model is able to predict with high-accuracy tokens that what we will see 
next. Consider the following continuations of the phrase “It is raining”, as proposed by different 
language models: 


1. “It is raining outside” 
2. “It is raining banana tree” 
3. “It is raining piouw;kcj pwepoiut” 


In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. 
While it might not quite accurately reflect which word follows semantically (“in San Francisco” and 
“in winter” would have been perfectly reasonable extensions), the model is able to capture which 
kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. 
Nonetheless, at least the model has learned how to spell words and some degree of correlation 
between words. Last, example 3 indicates a poorly trained model that does not fit data properly. 


We might measure the quality of the model by computing the likelihood of the sequence. Un- 
fortunately this is a number that is hard to understand and difficult to compare. After all, shorter 
sequences are much more likely to occur than the longer ones, hence evaluating the model on Tol- 
stoy’s magnum opus War and Peace will inevitably produce a much smaller likelihood than, say, 
on Saint-Exupery’s novella The Little Prince. What is missing is the equivalent of an average. 
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Information theory comes handy here. We have defined entropy, surprisal, and cross-entropy 
when we introduced the softmax regression (Section 3.4.7) and more of information theory is 
discussed in the online appendix on information theory*%, If we want to compress text, we can ask 
about predicting the next token given the current set of tokens. A better language model should 
allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in 
compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the n 
tokens of a sequence: 


1 n 
— > ~ log Pu: | te1,---, 21), (8.4.7) 
t=! 


where P is given by a language model and z; is the actual token observed at time step t from the 
sequence. This makes the performance on documents of different lengths comparable. For his- 
torical reasons, scientists in natural language processing prefer to use a quantity called perplexity. 
In a nutshell, it is the exponential of (8.4.7): 


1 n 
exp (-2 > log Pe | £t—1,... 2) : (8.4.8) 
t=1 


Perplexity can be best understood as the harmonic mean of the number of real choices that we 
have when deciding which token to pick next. Let us look at a number of cases: 


+ In the best case scenario, the model always perfectly estimates the probability of the label 
token as 1. In this case the perplexity of the model is 1. 


* In the worst case scenario, the model always predicts the probability of the label token as 0. 
In this situation, the perplexity is positive infinity. 


+ At the baseline, the model predicts a uniform distribution over all the available tokens of the 
vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabu- 
lary. In fact, if we were to store the sequence without any compression, this would be the 
best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful 
model must beat. 


In the following sections, we will implement RNNs for character-level language models and use 
perplexity to evaluate such models. 


Summary 
+ A neural network that uses recurrent computation for hidden states is called a recurrent 
neural network (RNN). 


* The hidden state of an RNN can capture historical information of the sequence up to the 
current time step. 


* The number of RNN model parameters does not grow as the number of time steps increases. 
e We can create character-level language models using an RNN. 


e We can use perplexity to evaluate the quality of language models. 
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Exercises 


1. If we use an RNN to predict the next character in a text sequence, what is the required di- 
mension for any output? 


2. Why can RNNs express the conditional probability of a token at some time step based on all 
the previous tokens in the text sequence? 


3. What happens to the gradient if you backpropagate through a long sequence? 


4. What are some of the problems associated with the language model described in this sec- 
tion? 


Discussions!” 


8.5 Implementation of Recurrent Neural Networks from Scratch 


In this section we will implement an RNN from scratch for a character-level language model, ac- 
cording to our descriptions in Section 8.4. Such a model will be trained on H. G. Wells’ The Time 
Machine. As before, we start by reading the dataset first, which is introduced in Section 8.3. 


%matplotlib inline 

from d21 import mxnet as d21 

import math 

from mxnet import autograd, gluon, np, npx 
npx.set_np() 


batch_size, num_steps = 32, 35 
train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 


8.5.1 One-Hot Encoding 


Recall that each token is represented as a numerical index in train_iter. Feeding these indices 
directly to a neural network might make it hard to learn. We often represent each token as a 
more expressive feature vector. The easiest representation is called one-hot encoding, which is 
introduced in Section 3.4.1. 


In a nutshell, we map each index to a different unit vector: assume that the number of different 
tokens in the vocabulary is N (len(vocab)) and the token indices range from 0 to N — 1. If the 
index of a token is the integer i, then we create a vector of all 0s with a length of N and set the 
element at position i to 1. This vector is the one-hot vector of the original token. The one-hot 
vectors with indices 0 and 2 are shown below. 


npx.one_hot(np.array([0, 21), len(vocab)) 


a Oop Oog Dog Ooo Ooo Oss Oop Oop Oco Ge Oos Oso Oop Oop Ocg 
Ocg Oog Oso Oop Oop Oso Ooo Oon Osp Ocg Osp Oc; 
Oos Org lag Oop Oog Dop Oop Oas Oop Osp Osp Oop Oop Oco Ocg Org 
Ong Oop Ocg Oh, Oso Oog 208, Oog Oco Ors OID 





107 https://discuss.d21.ai/t/337 





8.5. Implementation of Recurrent Neural Networks from Scratch 327 


The shape of the minibatch that we sample each time is (batch size, number of time steps). The 
one_hot function transforms such a minibatch into a three- dimensional tensor with the last di- 
mension equals to the vocabulary size (len(vocab)). We often transpose the input so that we will 
obtain an output of shape (number of time steps, batch size, vocabulary size). This will allow 
us to more conveniently loop through the outermost dimension for updating hidden states of a 
minibatch, time step by time step. 


X = np.arange(10).reshape((2, 5)) 
npx.one_hot(X.T, 28).shape 


(5, 2, 28) 


8.5.2 Initializing the Model Parameters 


Next, we initialize the model parameters for the RNN model. The number of hidden units 
num_hiddens is a tunable hyperparameter. When training language models, the inputs and out- 
puts are from the same vocabulary. Hence, they have the same dimension, which is equal to the 
vocabulary size. 


def get_params(vocab_size, num_hiddens, device): 
num_inputs = num_outputs = vocab_size 


def normal (shape) : 
return np.random.normal(scale=0.01, size=shape, ctx=device) 


# Hidden layer parameters 
W_xh = normal((num_inputs, num_hiddens)) 
W_hh = normal((num_hiddens, num_hiddens)) 
b_h = np.zeros(num_hiddens, ctx=device) 
# Output layer parameters 
W_hq = normal((num_hiddens, num_outputs)) 
b_q = np.zeros(num_outputs, ctx=device) 
# Attach gradients 
params = [W_xh, W_hh, b_h, W_hq, b_q] 
for param in params: 
param. attach_grad() 
return params 





8.5.3 RNN Model 


To define an RNN model, we first need an init_rnn_state function to return the hidden state at 
initialization. It returns a tensor filled with 0 and with a shape of (batch size, number of hidden 
units). Using tuples makes it easier to handle situations where the hidden state contains multiple 
variables, which we will encounter in later sections. 


def init_rnn_state(batch_size, num_hiddens, device): 
return (np.zeros((batch_size, num_hiddens), ctx=device), ) 


The following rnn function defines how to compute the hidden state and output at a time step. 
Note that the RNN model loops through the outermost dimension of inputs so that it updates 
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hidden states H of a minibatch, time step by time step. Besides, the activation function here uses 
the tanh function. As described in Section 4.1, the mean value of the tanh function is 0, when the 
elements are uniformly distributed over the real numbers. 


def rnn(inputs, state, params): 

# Shape of 'inputs': ('num_steps', 'batch_size', 'vocab_size') 

W_xh, W_hh, b_h, W_hq, b_q = params 

H, = state 

outputs = [] 

# Shape of `X`: (‘batch_size‘, 'vocab_size') 

for X in inputs: 
H = np.tanh(np.dot(X, W_xh) + np.dot(H, W_hh) + b_h) 
Y = np.dot(H, W_hq) + b_q 
outputs. append(Y) 

return np.concatenate(outputs, axis=0), (H,) 





With all the needed functions being defined, next we create a class to wrap these functions and 
store parameters for an RNN model implemented from scratch. 


class RNNModelScratch: #@save 
"""An RNN Model implemented from scratch.””” 
def __init__(self, vocab_size, num_hiddens, device, get_params, 
init_state, forward_fn): 
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens 
self.params = get_params(vocab_size, num_hiddens, device) 
self.init_state, self.forward_fn = init_state, forward_fn 


def __call__(self, X, state): 
X = npx.one_hot(X.T, self.vocab_size) 
return self.forward_fn(X, state, self.params) 


def begin_state(self, batch_size, ctx): 
return self.init_state(batch_size, self.num_hiddens, ctx) 


Let us check whether the outputs have the correct shapes, e.g., to ensure that the dimensionality 
of the hidden state remains unchanged. 


num_hiddens = 512 

net = RNNModelScratch(len(vocab), num_hiddens, d21.try_gpu(), get_params, 
init_rnn_state, rnn) 

state = net.begin_state(X.shape[@], d21.try_gpu()) 

Y, new_state = net(X.as_in_context(d21.try_gpu()), state) 

Y.shape, len(new_state), new_state[0].shape 


((10, 28), 1, (2, 512)) 


We can see that the output shape is (number of time steps x batch size, vocabulary size), while the 
hidden state shape remains the same, i.e., (batch size, number of hidden units). 
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8.5.4 Prediction 


Let us first define the prediction function to generate new characters following the user-provided 
prefix, which is a string containing several characters. When looping through these beginning 
characters in prefix, we keep passing the hidden state to the next time step without generating 
any output. This is called the warm-up period, during which the model updates itself (e.g., update 
the hidden state) but does not make predictions. After the warm-up period, the hidden state is 
generally better than its initialized value at the beginning. So we generate the predicted characters 
and emit them. 


def predict_ch8(prefix, num_preds, net, vocab, device): #@save 
"""Generate new characters following the '*prefix'.””" 
state = net.begin_state(batch_size=1, ctx=device) 
outputs = [vocab[prefix[0]]] 
get_input = lambda: np.array(Loutputs[-1]], ctx=device).reshape((1, 1)) 
for y in prefix[1:]: + Warm-up period 
_, State = net(get_input(), state) 
outputs. append(vocabLy]) 
for _ in range(num_preds): # Predict ‘num_preds* steps 
y, state = net(get_input(), state) 
outputs. append(int(y.argmax(axis=1).reshape(1))) 
return ''.join([vocab.idx_to_token[il] for i in outputs]) 


Now we can test the predict_ch8 function. We specify the prefix as time traveller and have it 
generate 10 additional characters. Given that we have not trained the network, it will generate 
nonsensical predictions. 


predict_ch8('time traveller ', 10, net, vocab, d21.try_gpu()) 


8.5.5 Gradient Clipping 


For a sequence of length T, we compute the gradients over these T time steps in an iteration, which 
results in a chain of matrix-products with length O(T) during backpropagation. As mentioned 
in Section 4.8, it might result in numerical instability, e.g., the gradients may either explode or 
vanish, when T is large. Therefore, RNN models often need extra help to stabilize the training. 


Generally speaking, when solving an optimization problem, we take update steps for the model 
parameter, say in the vector form x, in the direction of the negative gradient g on a minibatch. For 
example, with 7 > 0 as the learning rate, in one iteration we update x as x — ng. Let us further 
assume that the objective function f is well behaved, say, Lipschitz continuous with constant L. 
That is to say, for any x and y we have 


Iœ — f(y)| < Lix- yll. (8.5.1) 
In this case we can safely assume that if we update the parameter vector by ng, then 
f(x) — f(x — ng)| < Lollgll, (8.5.2) 


which means that we will not observe a change by more than Ln||g||. This is both a curse and a 
blessing. On the curse side, it limits the speed of making progress; whereas on the blessing side, 
it limits the extent to which things can go wrong if we move in the wrong direction. 
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Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. 
We could address this by reducing the learning rate 7. But what if we only rarely get large gradients? 
In this case such an approach may appear entirely unwarranted. One popular alternative is to clip 
the gradient g by projecting them back to a ball of a given radius, say 8 via 


g — min (1. ai) g. (8.5.3) 


By doing so we know that the gradient norm never exceeds 0 and that the updated gradient is 
entirely aligned with the original direction of g. It also has the desirable side-effect of limiting 
the influence any given minibatch (and within it any given sample) can exert on the parameter 
vector. This bestows a certain degree of robustness to the model. Gradient clipping provides a 
quick fix to the gradient exploding. While it does not entirely solve the problem, it is one of the 
many techniques to alleviate it. 


Below we define a function to clip the gradients of a model that is implemented from scratch or a 
model constructed by the high-level APIs. Also note that we compute the gradient norm over all 
the model parameters. 


def grad_clipping(net, theta): #@save 

"""Clip the gradient.””" 
if isinstance(net, gluon.Block): 

params = [p.data() for p in net.collect_params().values() ] 
else: 

params = net.params 
norm = math.sqrt(sum((p.grad x» 2).sum() for p in params)) 
if norm > theta: 

for param in params: 

param.grad[:] *= theta / norm 


8.5.6 Training 


Before training the model, let us define a function to train the model in one epoch. It differs from 
how we train the model of Section 3.6 in three places: 


1. Different sampling methods for sequential data (random sampling and sequential partition- 
ing) will result in differences in the initialization of hidden states. 


2. We clip the gradients before updating the model parameters. This ensures that the model 
does not diverge even when gradients blow up at some point during the training process. 


3. We use perplexity to evaluate the model. As discussed in Section 8.4.4, this ensures that 
sequences of different length are comparable. 


Specifically, when sequential partitioning is used, we initialize the hidden state only at the begin- 
ning of each epoch. Since the it? subsequence example in the next minibatch is adjacent to the 
current i" subsequence example, the hidden state at the end of the current minibatch will be 
used to initialize the hidden state at the beginning of the next minibatch. In this way, historical 
information of the sequence stored in the hidden state might flow over adjacent subsequences 
within an epoch. However, the computation of the hidden state at any point depends on all the 
previous minibatches in the same epoch, which complicates the gradient computation. To reduce 
computational cost, we detach the gradient before processing any minibatch so that the gradient 
computation of the hidden state is always limited to the time steps in one minibatch. 
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When using the random sampling, we need to re-initialize the hidden state for each iteration since 
each example is sampled with a random position. Same as the train_epoch_ch3 function in Sec- 
tion 3.6, updater is a general function to update the model parameters. It can be either the d21. 
sgd function implemented from scratch or the built-in optimization function in a deep learning 
framework. 


#@save 
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter): 
"""Train a model within one epoch (defined in Chapter 8).””” 
state, timer = None, d21.Timer() 
metric = d21.Accumulator(2) + Sum of training loss, no. of tokens 
for X, Y in train_iter: 
if state is None or use_random_iter: 
# Initialize ‘state’ when either it is the first iteration or 
# using random sampling 
state = net.begin_state(batch_size=X.shape[0], ctx=device) 
else: 
for s in state: 
s.detach() 
y = Y.T.reshape(-1) 
X, y = X.as_in_ctx(device), y.as_in_ctx(device) 
with autograd.record(): 
y_hat, state = net(X, state) 
1 = loss(y_hat, y).mean() 
1.backward() 
grad_clipping(net, 1) 
updater(batch_size=1) # Since the ‘mean* function has been invoked 
metric.add(1 x y.size, y.size) 
return math.exp(metric[0] / metric[1]), metric[1] / timer.stop() 


The training function supports an RNN model implemented either from scratch or using high- 
level APIs. 


def train_ch8(net, train_iter, vocab, lr, num_epochs, device, #@save 
use_random_iter=False): 
"""Train a model (defined in Chapter 8).””” 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 
animator = d21.Animator(xlabel='epoch', ylabel='perplexity', 
legend=['train'], xlim=[10, num_epochs]) 

# Initialize 
if isinstance(net, gluon.Block): 

net. initialize(ctx=device, force_reinit=True, 

init=init.Normal(0.01)) 
trainer = gluon.Trainer(net.collect_params(), 
'sgd', {'’learning_rate’: 1r)) 

updater = lambda batch_size: trainer.step(batch_size) 
else: 

updater = lambda batch_size: d21.sgd(net.params, lr, batch_size) 
predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device) 
# Train and predict 
for epoch in range(num_epochs): 

ppl, speed = train_epoch_ch8( 

net, train_iter, loss, updater, device, use_random_iter) 
if (epoch + 1) % 10 == 0: 
animator.add(epoch + 1, [ppl]) 


(continues on next page) 





332 Chapter 8. Recurrent Neural Networks 


(continued from previous page) 


print(f 'perplexity {ppl:.1f}, (speed: .1f) tokens/sec on {str(device) }’) 
print(predict('time traveller')) 
print(predict('traveller')) 


Now we can train the RNN model. Since we only use 10000 tokens in the dataset, the model needs 
more epochs to converge better. 


num_epochs, Ir = 500, 1 
train_ch8(net, train_iter, vocab, 1r, num_epochs, d21.try_gpu()) 


perplexity 1.1, 30751.6 tokens/sec on gpu(0) 
time traveller held in his hand was a glitteringmetallic framewo 
traveller with a slight accession ofcheerfulness really thi 


perplexity 
ul N 
o ul 


N 
ul 


100 200 300 400 500 
epoch 


Finally, let us check the results of using the random sampling method. 


train_ch8(net, train_iter, vocab, lr, num_epochs, d21.try_gpu(), 
use_random_iter=True) 


perplexity 1.3, 31399.1 tokens/sec on gpu(0) 
time traveller but now you begin to seethe object of my investig 
traveller came back andfilby s anecdote collapsedthe thing 
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1.8 


1.6 


perplexity 
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1.2 
100 200 300 400 500 


epoch 


While implementing the above RNN model from scratch is instructive, it is not convenient. In 
the next section we will see how to improve the RNN model, such as how to make it easier to 
implement and make it run faster. 


Summary 


We can train an RNN-based character-level language model to generate text following the 
user-provided text prefix. 


A simple RNN language model consists of input encoding, RNN modeling, and output gen- 
eration. 


RNN models need state initialization for training, though random sampling and sequential 
partitioning use different ways. 


When using sequential partitioning, we need to detach the gradientto reduce computational 
cost. 


A warm-up period allows a model to update itself (e.g., obtain a better hidden state than its 
initialized value) before making any prediction. 


Gradient clipping prevents gradient explosion, but it cannot fix vanishing gradients. 


Exercises 


1. Show that one-hot encoding is equivalent to picking a different embedding for each object. 


2. Adjust the hyperparameters (e.g., number of epochs, number of hidden units, number of 
time steps in a minibatch, and learning rate) to improve the perplexity. 


* How low can you go? 


* Replace one-hot encoding with learnable embeddings. Does this lead to better perfor- 
mance? 


* How well will it work on other books by H. G. Wells, e.g., The War of the Worlds!°%? 





108 http://www.gutenberg.org/ebooks/36 
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3. Modify the prediction function such as to use sampling rather than picking the most likely 
next character. 


e What happens? 


e Bias the model towards more likely outputs, e.g., by sampling from q(=; | 
bss a) X Pe | Massa) tora > 1, 


4. Run the code in this section without clipping the gradient. What happens? 


5. Change sequential partitioning so that it does not separate hidden states from the computa- 
tional graph. Does the running time change? How about the perplexity? 


6. Replace the activation function used in this section with ReLU and repeat the experiments 
in this section. Do we still need gradient clipping? Why? 


Discussions! 


8.6 Concise Implementation of Recurrent Neural Networks 


While Section 8.5 was instructive to see how RNNs are implemented, this is not convenient or fast. 
This section will show how to implement the same language model more efficiently using func- 
tions provided by high-level APIs of a deep learning framework. We begin as before by reading 
the time machine dataset. 


from d21 import mxnet as d21 
from mxnet import np, npx 

from mxnet.gluon import nn, rnn 
npx.set_np() 


batch_size, num_steps = 32, 35 
train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 


Downloading ../data/timemachine.txt from http://d21-data.s3-accelerate.amazonaws.com/ 
<+timemachine. txt... 


8.6.1 Defining the Model 


High-level APIs provide implementations of recurrent neural networks. We construct the recur- 
rent neural network layer rnn_layer with a single hidden layer and 256 hidden units. In fact, we 
have not even discussed yet what it means to have multiple layers—this will happen in Section 9.3. 
For now, suffice it to say that multiple layers simply amount to the output of one layer of RNN 
being used as the input for the next layer of RNN. 


num_hiddens = 256 
rnn_layer = rnn.RNN(num_hiddens) 
rnn_layer.initialize() 


Initializing the hidden state is straightforward. We invoke the member function begin_state. This 
returns a list (state) that contains an initial hidden state for each example in the minibatch, whose 
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shape is (number of hidden layers, batch size, number of hidden units). For some models to be 
introduced later (e.g., long short-term memory), such a list also contains other information. 


state = rnn_layer.begin_state(batch_size=batch_size) 
len(state), state[0].shape 


(1, (1, 32, 256)) 


With a hidden state and an input, we can compute the output with the updated hidden state. It 
should be emphasized that the “output” (Y) of rnn_layer does not involve computation of output 
layers: it refers to the hidden state at each time step, and they can be used as the input to the 
subsequent output layer. 


Besides, the updated hidden state (state_new) returned by rnn_layer refers to the hidden state 
at the last time step of the minibatch. It can be used to initialize the hidden state for the next 
minibatch within an epoch in sequential partitioning. For multiple hidden layers, the hidden 
state of each layer will be stored in this variable (state_new). For some models to be introduced 
later (e.g., long short-term memory), this variable also contains other information. 


X = np.random.uniform(size=(num_steps, batch_size, len(vocab))) 
Y, state_new = rnn_layer(X, state) 
Y.shape, len(state_new), state_new[0].shape 


((35, 32, 256), 1, (1, 32, 256)) 


Similar to Section 8.5, we define an RNNModel class for a complete RNN model. Note that rnn_layer 
only contains the hidden recurrent layers, we need to create a separate output layer. 


#@save 
class RNNModel (nn.Block) : 
"""The RNN model.””" 
def __init__(self, rnn_layer, vocab_size, **kwargs): 
super (RNNModel, self).__init__(**kwargs) 
self.rnn = rnn_layer 
self.vocab_size = vocab_size 
self.dense = nn.Dense(vocab_size) 


def forward(self, inputs, state): 
X = npx.one_hot(inputs.T, self.vocab_size) 
Y, state = self.rnn(X, state) 
# The fully-connected layer will first change the shape of ‘Y* to 
# (‘num_steps* * ‘batch_size*‘, ‘num_hiddens*‘). Its output shape is 
# (‘num_steps* * ‘batch_size*‘, 'vocab_size'). 
output = self.dense(Y.reshape(-1, Y.shape[-1])) 
return output, state 


def begin_state(self, xargs, **kwargs): 
return self.rnn.begin_state(*args, **kwargs) 
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8.6.2 Training and Predicting 


Before training the model, let us make a prediction with the a model that has random weights. 
device = d21.try_gpu() 
net = RNNModel (rnn_layer, len(vocab)) 


net.initialize(force_reinit=True, ctx=device) 
d21.predict_ch8('time traveller’, 10, net, vocab, device) 


"time travellervmoopwrrrr’ 


As is quite obvious, this model does not work at all. Next, we call train_ch8 with the same hyper- 
parameters defined in Section 8.5 and train our model with high-level APIs. 


num_epochs, lr = 500, 1 
d21.train_ch8(net, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.2, 159260.9 tokens/sec on gpu(0) 
time traveller but now you begin to seethe object of the fire wi 
travellery it but some foolishpeople have been as ingerence 


15 — train 


10 


perplexity 


100 200 300 400 500 
epoch 


Compared with the last section, this model achieves comparable perplexity, albeit within a shorter 
period of time, due to the code being more optimized by high-level APIs of the deep learning 
framework. 


Summary 


+ High-level APIs of the deep learning framework provides an implementation of the RNN 
layer. 


* The RNN layer of high-level APIs returns an output and an updated hidden state, where the 
output does not involve output layer computation. 


e Using high-level APIs leads to faster RNN training than using its implementation from 
scratch. 
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Exercises 


1. Can you make the RNN model overfit using the high-level APIs? 


2. What happens if you increase the number of hidden layers in the RNN model? Can you make 
the model work? 


3. Implement the autoregressive model of Section 8.1 using an RNN. 


Discussions! 


8.7 Backpropagation Through Time 


So far we have repeatedly alluded to things like exploding gradients, vanishing gradients, and the 
need to detach the gradient for RNNs. For instance, in Section 8.5 we invoked the detach function 
on the sequence. None of this was really fully explained, in the interest of being able to build a 
model quickly and to see how it works. In this section, we will delve a bit more deeply into the 
details of backpropagation for sequence models and why (and how) the mathematics works. 


We encountered some of the effects of gradient explosion when we first implemented RNNs (Sec- 
tion 8.5). In particular, if you solved the exercises, you would have seen that gradient clipping is 
vital to ensure proper convergence. To provide a better understanding of this issue, this section 
will review how gradients are computed for sequence models. Note that there is nothing con- 
ceptually new in how it works. After all, we are still merely applying the chain rule to compute 
gradients. Nonetheless, it is worth while reviewing backpropagation (Section 4.7) again. 


We have described forward and backward propagations and computational graphs in MLPs in 
Section 4.7. Forward propagation in an RNN is relatively straightforward. Backpropagation through 
time is actually a specific application of backpropagation in RNNs (Werbos, 1990). It requires us 
to expand the computational graph of an RNN one time step at a time to obtain the dependencies 
among model variables and parameters. Then, based on the chain rule, we apply backpropagation 
to compute and store gradients. Since sequences can be rather long, the dependency can be rather 
lengthy. For instance, for a sequence of 1000 characters, the first token could potentially have 
significant influence on the token at the final position. This is not really computationally feasible 
(it takes too long and requires too much memory) and it requires over 1000 matrix products before 
we would arrive at that very elusive gradient. This is a process fraught with computational and 
statistical uncertainty. In the following we will elucidate what happens and how to address this in 
practice. 


8.7.1 Analysis of Gradients in RNNs 


We start with a simplified model of how an RNN works. This model ignores details about the 
specifics of the hidden state and how it is updated. The mathematical notation here does not 
explicitly distinguish scalars, vectors, and matrices as it used to do. These details are immaterial 
to the analysis and would only serve to clutter the notation in this subsection. 


In this simplified model, we denote h; as the hidden state, x; as the input, and o; as the output 
at time step t. Recall our discussions in Section 8.4.2 that the input and the hidden state can be 
concatenated to be multiplied by one weight variable in the hidden layer. Thus, we use w, and 
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wo to indicate the weights of the hidden layer and the output layer, respectively. As a result, the 


hidden states and outputs at each time steps can be explained as 
hi = f (xt, hi1, Wh), 
= Teale ate) (8.7.1) 
or = g(h, Wo), 


where f and g are transformations of the hidden layer and the output layer, respectively. Hence, 
we have a chain of values {... , (1-1, h+-1,01-1), (2, ht, 04), . . .} that depend on each other via re- 
current computation. The forward propagation is fairly straightforward. All we need is to loop 
through the (z+, h+, o+) triples one time step at a time. The discrepancy between output o; and the 
desired label y, is then evaluated by an objective function across all the T time steps as 


L(51,..., ET, Yi, +++) YT, Why Wo) = = D Yt, 0t). (8.7.2) 


For backpropagation, matters are a bit trickier, especially when we compute the gradients with 
regard to the parameters wp of the objective function L. To be specific, by the chain rule, 


D Ol (Yt, 01) 
Ow), =T ðw, 


3 Olly, 04) Og(he, wn) Ohe 
“T do; Oh; Owp, 





(8.7.3) 





The first and the second factors of the product in (8.7.3) are easy to compute. The third factor 
Oh;/Owp is where things get tricky, since we need to recurrently compute the effect of the param- 
eter wp on h+. According to the recurrent computation in (8.7.1), ht depends on both h;_; and wp, 
where computation of h;_, also depends on wp. Thus, using the chain rule yields 


Oht _ Of (vt, ht-1, Wn) 4 Of (xt, ht-1, Wn) Oht-1 
Own Own Oht-1 Owp 


To derive the above gradient, assume that we have three sequences {az}, {b+}, {c+} satisfying ay = 0 
and a; = bi + catı fort = 1,2,.... Then for t > 1, it is easy to show 


(8.7.4) 





t 


t-1 
at = bi + y II Cj bi. (8.7.5) 








i=1 A j=i+1 
By substituting aş, b,, and c; according to 
a, = ħi 
t— Owh,’ 
_ Of (at, ht-1, Wh) 
bi = Ai (8.7.6) 
a Of (xt, ht-1, Wh) 
t Ohi-1 


the gradient computation in (8.7.4) satisfies a; = b; + c,a;_1. Thus, per (8.7.5), we can remove the 
recurrent computation in (8.7.4) with 
t-1 t 
Ohe — Of (@t,ht-1, Wn) y ll Of (xj, hj-1, wn) | Of (zi, hi-1, Wh) 
i=l Ohj=1 Own ` 








2.7 
Owp, Own, OF) 


j=i+1 


While we can use the chain rule to compute 0h¿/0w, recursively, this chain can get very long 
whenever t is large. Let us discuss a number of strategies for dealing with this problem. 
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Full Computation 


Obviously, we can just compute the full sum in (8.7.7). However, this is very slow and gradients 
can blow up, since subtle changes in the initial conditions can potentially affect the outcome a 
lot. That is, we could see things similar to the butterfly effect where minimal changes in the initial 
conditions lead to disproportionate changes in the outcome. This is actually quite undesirable in 
terms of the model that we want to estimate. After all, we are looking for robust estimators that 
generalize well. Hence this strategy is almost never used in practice. 


Truncating Time Steps 


Alternatively, we can truncate the sum in (8.7.7) after 7 steps. This is what we have been discussing 
so far, such as when we detached the gradients in Section 8.5. This leads to an approximation of the 
true gradient, simply by terminating the sum at 0h:_,/Owp. In practice this works quite well. It 
is what is commonly referred to as truncated backpropgation through time (Jaeger, 2002). One of 
the consequences of this is that the model focuses primarily on short-term influence rather than 
long-term consequences. This is actually desirable, since it biases the estimate towards simpler 
and more stable models. 


Randomized Truncation 


Last, we can replace 0h¿/0w,, by a random variable which is correct in expectation but truncates 
the sequence. This is achieved by using a sequence of ¢ with predefined 0 < a < 1, where 
P(& = 0) = 1 — m and P(& = 71,1) = m thus E[é;] = 1. We use this to replace the gradient 
Ohi /Owp, in (8.7.4) with 





Of (xt, he-1, Wh) OF (i; he-1, Wh) Oht-1 


fa 
Owp, i Oh+-1 Owp, (8 a) 


a= 


It follows from the definition of & that E[z:] = Oh:/Ow;,. Whenever & = 0 the recurrent computa- 
tion terminates at that time step t. This leads to a weighted sum of sequences of varying lengths 
where long sequences are rare but appropriately overweighted. This idea was proposed by Tallec 
and Ollivier (Tallec & Ollivier, 2017). 


Comparing Strategies 


the time machine by h g well 


Fig. 8.7.1: Comparing strategies for computing gradients in RNNs. From top to bottom: random- 
ized truncation, regular truncation, and full computation. 


Fig. 8.7.1 illustrates the three strategies when analyzing the first few characters of The Time Ma- 
chine book using backpropagation through time for RNNs: 
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° The first row is the randomized truncation that partitions the text into segments of varying 
lengths. 


* The second row is the regular truncation that breaks the text into subsequences of the same 
length. This is what we have been doing in RNN experiments. 


e The third row is the full backpropagation through time that leads to a computationally in- 
feasible expression. 


Unfortunately, while appealing in theory, randomized truncation does not work much better than 
regular truncation, most likely due to a number of factors. First, the effect of an observation after 
a number of backpropagation steps into the past is quite sufficient to capture dependencies in 
practice. Second, the increased variance counteracts the fact that the gradient is more accurate 
with more steps. Third, we actually want models that have only a short range of interactions. 
Hence, regularly truncated backpropagation through time has a slight regularizing effect that can 
be desirable. 


8.7.2 Backpropagation Through Time in Detail 


After discussing the general principle, let us discuss backpropagation through time in detail. Dif- 
ferent from the analysis in Section 8.7.1, in the following we will show how to compute the gra- 
dients of the objective function with respect to all the decomposed model parameters. To keep 
things simple, we consider an RNN without bias parameters, whose activation function in the 
hidden layer uses the identity mapping (¢(x) = x). For time step t, let the single example input 
and the label be x; € R? and y; respectively. The hidden state h, € R” and the output o; € R4 are 
computed as 


h; = WarX: + W,,h;-1, 
t haXt nnM;-1 (8.7.9) 
o; = W,¿h., 
where Wps € R“*% Wan € RX”, and Wy, € R%” are the weight parameters. Denote by 1(0;, y+) 
the loss at time step t. Our objective function, the loss over T time steps from the beginning of the 
sequence is thus 


T 


ze > Y korn ye). (8.7.10) 
t=1 

In order to visualize the dependencies among model variables and parameters during computa- 

tion of the RNN, we can draw a computational graph for the model, as shown in Fig. 8.7.2. For 

example, the computation of the hidden states of time step 3, hg, depends on the model parame- 

ters Waz and W),,, the hidden state of the last time step hg, and the input of the current time step 

X3. 
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Fig. 8.7.2: Computational graph showing dependencies for an RNN model with three time steps. 
Boxes represent variables (not shaded) or parameters (shaded) and circles represent operators. 


As just mentioned, the model parameters in Fig. 8.7.2 are Wn, Wyn, and W¿». Generally, training 
this model requires gradient computation with respect to these parameters OL/0W»,:, OL/IW»», 
and 0L/0W,¿». According to the dependencies in Fig. 8.7.2, we can traverse in the opposite direc- 
tion of the arrows to calculate and store the gradients in turn. To flexibly express the multiplica- 
tion of matrices, vectors, and scalars of different shapes in the chain rule, we continue to use the 
prod operator as described in Section 4.7. 


First of all, differentiating the objective function with respect to the model output at any time step 
t is fairly straightforward: 
OL _ Ol(o:, Yt) 


= ER’. 8.7.11 
00; T. 00; ( ) 





Now, we can calculate the gradient of the objective function with respect to the parameter W,,, in 
the output layer: OL /OW,», € R9*”. Based on Fig. 8.7.2, the objective function L depends on Wgn 
via 01,...,07. Using the chain rule yields 


OL E dL ðo; E aa 
Z > E = > ——h 7.12 
OW gh prod (=, w) = 00; to (8 ) 


t=1 








where 0L/00, is given by (8.7.11). 


Next, as shown in Fig. 8.7.2, at the final time step T the objective function L depends on the hidden 
state hy only via or. Therefore, we can easily find the gradient OL /0h7 e R” using the chain rule: 


OL roa (2: ber) = OL 


=== =e — .7.1 
Ohy dor’ Ohy a dor Tals) 


It gets trickier for any time step t < T, where the objective function L depends on h; via hi+1 and 
o;. According to the chain rule, the gradient of the hidden state 0L/Oh; € R” at any time step 
t < T can be recurrently computed as: 


aL ƏL O OL ðo; = OF - OL 
Po | 2 r) -w ew, 7.14 
oh, prod (> oh, ) pron (= e) hon o, ea) 








For analysis, expanding the recurrent computation for any time step 1 < t < T gives 


a 5 
OL as T T—i T OL 
ah, = A (Win) Wazon (8.7.15) 
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We can see from (8.7.15) that this simple linear example already exhibits some key problems of 
long sequence models: it involves potentially very large powers of W}, . Init, eigenvalues smaller 
than 1 vanish and eigenvalues largerthan 1 diverge. This is numerically unstable, which manifests 
itself in the form of vanishing and exploding gradients. One way to address this is to truncate the 
time steps at a computationally convenient size as discussed in Section 8.7.1. In practice, this 
truncation is effected by detaching the gradient after a given number of time steps. Later on we 
will see how more sophisticated sequence models such as long short-term memory can alleviate 
this further. 


Finally, Fig. 8.7.2 shows that the objective function L depends on model parameters W;,, and W,» 
in the hidden layer via hidden states h,,...,hr. To compute gradients with respect to such pa- 
rameters OL/0W,. € R?” and 0L/OW),, € R’*", we apply the chain rule that gives 


ðL £ ðL Oh, ðL 
Wp > prod (Fe a) _ 2 ah,’ 


t=1 





(8.7.16) 
OL 


OW»» 





E m ) T ƏL 
prod = 


On? Ser = ——h/ ,, 
Oh,’ Win) 0h, 


Il 
Ms 


ot 
ll 


1 


where 0L/Oh; that is recurrently computed by (8.7.13) and (8.7.14) is the key quantity that affects 
the numerical stability. 


Since backpropagation through time is the application of backpropagation in RNNs, as we have 
explained in Section 4.7, training RNNs alternates forward propagation with backpropagation 
through time. Besides, backpropagation through time computes and stores the above gradients 
in turn. Specifically, stored intermediate values are reused to avoid duplicate calculations, such 
as storing OL /Oh; to be used in computation of both 0L/OW;, and 0OL/OWnp. 


Summary 


Backpropagation through time is merely an application of backpropagation to sequence 
models with a hidden state. 


Truncation is needed for computational convenience and numerical stability, such as regu- 
lar truncation and randomized truncation. 


High powers of matrices can lead to divergent or vanishing eigenvalues. This manifests itself 
in the form of exploding or vanishing gradients. 


For efficient computation, intermediate values are cached during backpropagation through 
time. 


Exercises 


1. Assume that we have a symmetric matrix M € R”*” with eigenvalues A; whose correspond- 
ing eigenvectors are v; (i = 1,...,n). Without loss of generality, assume that they are or- 
dered in the order |;| > |A;,1]. 


1. Show that M* has eigenvalues Af. 


2. Prove that for a random vector x € R”, with high probability M*x will be very much 
aligned with the eigenvector vı of M. Formalize this statement. 


3. What does the above result mean for gradients in RNNs? 
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2. Besides gradient clipping, can you think of any other methods to cope with gradient explo- 
sion in recurrent neural networks? 


Discussions!!! 





11 https://discuss.d21.ai/t/334 
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9 Modern Recurrent Neural Networks 


We have introduced the basics of RNNs, which can better handle sequence data. For demonstra- 
tion, we implemented RNN-based language models on text data. However, such techniques may 
not be sufficient for practitioners when they face a wide range of sequence learning problems 
nowadays. 


For instance, a notable issue in practice is the numerical instability of RNNs. Although we have 
applied implementation tricks such as gradient clipping, this issue can be alleviated further with 
more sophisticated designs of sequence models. Specifically, gated RNNs are much more com- 
mon in practice. We will begin by introducing two of such widely-used networks, namely gated 
recurrent units (GRUs) and long short-term memory (LSTM). Furthermore, we will expand the RNN 
architecture with a single undirectional hidden layer that has been discussed so far. We will de- 
scribe deep architectures with multiple hidden layers, and discuss the bidirectional design with 
both forward and backward recurrent computations. Such expansions are frequently adopted in 
modern recurrent networks. When explaining these RNN variants, we continue to consider the 
same language modeling problem introduced in Chapter 8. 


In fact, language modeling reveals only a small fraction of what sequence learning is capable of. 
In a variety of sequence learning problems, such as automatic speech recognition, text to speech, 
and machine translation, both inputs and outputs are sequences of arbitrary length. To explain 
how to fit this type of data, we will take machine translation as an example, and introduce the 
encoder-decoder architecture based on RNNs and beam search for sequence generation. 


9.1 Gated Recurrent Units (GRU) 


In Section 8.7, we discussed how gradients are calculated in RNNs. In particular we found that 
long products of matrices can lead to vanishing or exploding gradients. Let us briefly think about 
what such gradient anomalies mean in practice: 


e We might encounter a situation where an early observation is highly significant for predict- 
ing all future observations. Consider the somewhat contrived case where the first observa- 
tion contains a checksum and the goal is to discern whether the checksum is correct at the 
end of the sequence. In this case, the influence of the first token is vital. We would like to 
have some mechanisms for storing vital early information in a memory cell. Without such a 
mechanism, we will have to assign a very large gradient to this observation, since it affects 
all the subsequent observations. 


We might encounter situations where some tokens carry no pertinent observation. For in- 
stance, when parsing a web page there might be auxiliary HTML code that is irrelevant for 
the purpose of assessing the sentiment conveyed on the page. We would like to have some 
mechanism for skipping such tokens in the latent state representation. 
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e We might encounter situations where there is a logical break between parts of a sequence. 
For instance, there might be a transition between chapters in a book, or a transition between 
a bear and a bull market for securities. In this case it would be nice to have a means of 
resetting our internal state representation. 


A number of methods have been proposed to address this. One of the earliest is long short-term 
memory (Hochreiter € Schmidhuber, 1997) which we will discuss in Section 9.2. The gated recur- 
rent unit (GRU) (Cho et al., 2014a) is a slightly more streamlined variant that often offers compa- 
rable performance and is significantly faster to compute (Chung et al., 2014). Due to its simplicity, 
let us start with the GRU. 


9.1.1 Gated Hidden State 


The key distinction between vanilla RNNs and GRUs is that the latter support gating of the hidden 
state. This means that we have dedicated mechanisms for when a hidden state should be updated 
and also when it should be reset. These mechanisms are learned and they address the concerns 
listed above. For instance, if the first token is of great importance we will learn not to update 
the hidden state after the first observation. Likewise, we will learn to skip irrelevant temporary 
observations. Last, we will learn to reset the latent state whenever needed. We discuss this in 
detail below. 


Reset Gate and Update Gate 


The first thing we need to introduce are the reset gate and the update gate. We engineer them to be 
vectors with entries in (0, 1) such that we can perform convex combinations. For instance, a reset 
gate would allow us to control how much of the previous state we might still want to remember. 
Likewise, an update gate would allow us to control how much of the new state is just a copy of the 
old state. 


We begin by engineering these gates. Fig. 9.1.1 illustrates the inputs for both the reset and update 
gates in a GRU, given the input of the current time step and the hidden state of the previous time 
step. The outputs of two gates are given by two fully-connected layers with a sigmoid activation 
function. 
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Fig. 9.1.1: Computing the reset gate and the update gate in a GRU model. 


Mathematically, for a given time step t, suppose that the input is a minibatch X, € R”*“ (number of 
examples: n, number of inputs: d) and the hidden state of the previous time step is H,_¡ € R”*” 
(number of hidden units: A). Then, the reset gate R; € R”*” and update gate Z, € R"*” are 
computed as follows: 


R, = o(X:W,,, + H,_¡Wp, + b,), 


(9.1.1) 
Zt = o(X:Wzz + H;—ıWpnz + b.), 








where Wyr, Wps € RI”? and War, W,. € R’*” are weight parameters and b,,b, € R!*” are 
biases. Note that broadcasting (see Section 2.1.3) is triggered during the summation. We use 
sigmoid functions (as introduced in Section 4.1) to transform input values to the interval (0, 1). 


Candidate Hidden State 


Next, let us integrate the reset gate R; with the regular latent state updating mechanism in (8.4.5). 
It leads to the following candidate hidden state A, e IR"*" at time step t: 


H, = tanh(X,W,,, + (Ry © H,-1) Wan + bn), (9.1.2) 


where Wzn € R%” and Wp, € R”*" are weight parameters, b, € R!*" is the bias, and the symbol 
© is the Hadamard (elementwise) product operator. Here we use a nonlinearity in the form of 
tanh to ensure that the values in the candidate hidden state remain in the interval (—1, 1). 


The result is a candidate since we still need to incorporate the action of the update gate. Compar- 
ing with (8.4.5), now the influence of the previous states can be reduced with the elementwise 
multiplication of R; and H,_; in (9.1.2). Whenever the entries in the reset gate R; are close to 1, 
we recover a vanilla RNN such as in (8.4.5). For all entries of the reset gate R; that are close to 0, 
the candidate hidden state is the result of an MLP with X; as the input. Any pre-existing hidden 
state is thus reset to defaults. 


Fig. 9.1.2 illustrates the computational flow after applying the reset gate. 
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Fig. 9.1.2: Computing the candidate hidden state in a GRU model. 


Hidden State 


Finally, we need to incorporate the effect of the update gate Z;. This determines the extent to which 
the new hidden state H; € R”*? is just the old state H,_; and by how much the new candidate state 
H, is used. The update gate Z; can be used for this purpose, simply by taking elementwise convex 
combinations between both H;_; and H;. This leads to the final update equation for the GRU: 


H; = Z; © Hı + (1 — Z;) © Hi. (9.1.3) 


Whenever the update gate Z; is close to 1, we simply retain the old state. In this case the informa- 
tion from X; is essentially ignored, effectively skipping time step t in the dependency chain. In 
contrast, whenever Z; is close to 0, the new latent state H, approaches the candidate latent state 
H,. These designs can help us cope with the vanishing gradient problem in RNNs and better cap- 
ture dependencies for sequences with large time step distances. For instance, if the update gate 
has been close to 1 for all the time steps of an entire subsequence, the old hidden state at the time 
step of its beginning will be easily retained and passed to its end, regardless of the length of the 
subsequence. 


Fig. 9.1.3 illustrates the computational flow after the update gate is in action. 
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Fig. 9.1.3: Computing the hidden state in a GRU model. 


In summary, GRUs have the following two distinguishing features: 
e Reset gates help capture short-term dependencies in sequences. 


e Update gates help capture long-term dependencies in sequences. 


9.1.2 Implementation from Scratch 


To gain a better understanding of the GRU model, let us implement it from scratch. We begin by 
reading the time machine dataset that we used in Section 8.5. The code for reading the dataset is 
given below. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import rnn 
npx.set_np() 


batch_size, num_steps = 32, 35 
train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 


Initializing Model Parameters 


The next step is to initialize the model parameters. We draw the weights from a Gaussian distri- 
bution with standard deviation to be 0.01 and set the bias to 0. The hyperparameter num_hiddens 
defines the number of hidden units. We instantiate all weights and biases relating to the update 
gate, the reset gate, the candidate hidden state, and the output layer. 


def get_params(vocab_size, num_hiddens, device): 
num_inputs = num_outputs = vocab_size 


def normal (shape): 


(continues on next page) 
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(continued from previous page) 


return np.random.normal(scale=0.01, size=shape, ctx=device) 


def three(): 
return (normal((num_inputs, num_hiddens)), 
normal ((num_hiddens, num_hiddens)), 
np.zeros(num_hiddens, ctx=device)) 


W_xz, W_hz, b_z = three) # Update gate parameters 
W_xr, W_hr, b_r = three() # Reset gate parameters 
W_xh, W_hh, b_h = three() # Candidate hidden state parameters 
# Output layer parameters 
W_hq = normal((num_hiddens, num_outputs)) 
b_q = np.zeros(num_outputs, ctx=device) 
# Attach gradients 
params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q] 
for param in params: 
param. attach_grad() 
return params 





Defining the Model 


Now we will define the hidden state initialization function init_gru_state. Just like the 
init_rnn_state function defined in Section 8.5, this function returns a tensor with a shape (batch 
size, number of hidden units) whose values are all zeros. 


def init_gru_state(batch_size, num_hiddens, device): 
return (np.zeros(shape=(batch_size, num_hiddens), ctx=device), ) 


Now we are ready to define the GRU model. Its structure is the same as that of the basic RNN cell, 
except that the update equations are more complex. 


def gru(inputs, state, params): 

W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hg, b_q = params 

H, = state 

outputs = [] 

for X in inputs: 
Z = npx.sigmoid(np.dot(X, W_xz) + np.dot(H, W_hz) + b_z) 
R = npx.sigmoid(np.dot(X, W_xr) + np.dot(H, W_hr) + b_r) 
H_tilda = np.tanh(np.dot(X, W_xh) + np.dot(R x H, W_hh) + b_h) 
H=Zx* H+ (1 - Z) * H_tilda 
Y = np.dot(H, W_hq) + b_q 
outputs. append(Y) 

return np.concatenate(outputs, axis=0), (H,) 
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Training and Prediction 


Training and prediction work in exactly the same manner asin Section 8.5. After training, we print 
out the perplexity on the training set and the predicted sequence following the provided prefixes 
“time traveller” and “traveller”, respectively. 


vocab_size, num_hiddens, device = len(vocab), 256, d21.try_gpu() 

num_epochs, lr = 500, 1 

model = d21.RNNModelScratch(len(vocab), num_hiddens, device, get_params, 
init_gru_state, gru) 

d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.1, 11297.2 tokens/sec on gpu(0) 
time travelleryou can show black is white by argument said filby 
travelleryou can show black is white by argument said filby 
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9.1.3 Concise Implementation 


In high-level APIs, we can directly instantiate a GPU model. This encapsulates all the configuration 
detail that we made explicit above. The code is significantly faster as it uses compiled operators 
rather than Python for many details that we spelled out before. 


gru_layer = rnn.GRU(num_hiddens) 
model = d21.RNNModel(gru_layer, len(vocab)) 
d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.1, 155843.2 tokens/sec on gpu(0) 
time travelleryou can show black is white by argument said filby 
travelleryou can show black is white by argument said filby 
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Summary 


e Gated RNNs can better capture dependencies for sequences with large time step distances. 
e Reset gates help capture short-term dependencies in sequences. 
e Update gates help capture long-term dependencies in sequences. 


+ GRUs contain basic RNNs as their extreme case whenever the reset gate is switched on. They 
can also skip subsequences by turning on the update gate. 


Exercises 


1. Assume that we only want to use the input at time step t to predict the output at time step 
t > t. What are the best values for the reset and update gates for each time step? 


2. Adjust the hyperparameters and analyze the their influence on running time, perplexity, and 
the output sequence. 


3. Compare runtime, perplexity, and the output strings for rnn.RNN and rnn.GRU implementa- 
tions with each other. 


4. What happens if you implement only parts of a GRU, e.g., with only a reset gate or only an 
update gate? 


Discussions! 


9.2 Long Short-Term Memory (LSTM) 


The challenge to address long-term information preservation and short-term input skipping in 
latent variable models has existed for a long time. One of the earliest approaches to address this 
was the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997). It shares many of 
the properties of the GRU. Interestingly, LSTMs have a slightly more complex design than GRUs 
but predates GRUs by almost two decades. 





12 https://discuss.d21.ai/t/342 
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9.2.1 Gated Memory Cell 


Arguably LSTM’s design is inspired by logic gates of a computer. LSTM introduces a memory cell (or 
cell for short) that has the same shape as the hidden state (some literatures consider the memory 
cell as a special type of the hidden state), engineered to record additional information. To control 
the memory cell we need a number of gates. One gate is needed to read out the entries from the 
cell. We will refer to this as the output gate. A second gate is needed to decide when to read data 
into the cell. We refer to this as the input gate. Last, we need a mechanism to reset the content of 
the cell, governed by a forget gate. The motivation for such a design is the same as that of GRUs, 
namely to be able to decide when to remember and when to ignore inputs in the hidden state via 
a dedicated mechanism. Let us see how this works in practice. 


Input Gate, Forget Gate, and Output Gate 


Just like in GRUs, the data feeding into the LSTM gates are the input at the current time step and the 
hidden state of the previous time step, as illustrated in Fig. 9.2.1. They are processed by three fully- 
connected layers with a sigmoid activation function to compute the values of the input, forget. and 
output gates. As a result, values of the three gates are in the range of (0, 1). 


Forget Input 
gate gate 


da oh 


Hidden state 





Input X, 


FC layer with 
[o] activation fuction j Copy Concatenate 
Fig. 9.2.1: Computing the input gate, the forget gate, and the output gate in an LSTM model. 


Mathematically, suppose that there are h hidden units, the batch size is n, and the number of 
inputs is d. Thus, the input is X; € R”*? and the hidden state of the previous time step is H,_¡ € 
R"*", Correspondingly, the gates at time step t are defined as follows: the input gate is I, € R”*”, 
the forget gate is F; € R”*”, and the output gate is O, e R”*”. They are calculated as follows: 


I, = o (XW zi + Hi_1 Wri + bi), 
F, = o(X¿W. y + Hy-1 Way + by), (9.2.1) 
O, = o(X¿Wo + Hi—-1 Wp. + bo), 


where Wri, Wzf, Wo € R?” and Whi, Wht, Who € R’*" are weight parameters and b;, bs, bo € 
RX? are bias parameters. 
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Candidate Memory Cell 


Next we design the memory cell. Since we have not specified the action of the various gates yet, we 
first introduce the candidate memory cell Č; e R”*”., Its computation is similar to that of the three 
gates described above, but using a tanh function with a value range for (—1,1) as the activation 
function. This leads to the following equation at time step t: 


Č; = tanh(X,W... + H;_¡Wp. + be), (9.2.2) 


where W,,. € R%%” and Whe € R?*” are weight parameters and b, € R!*" is a bias parameter. 


A quick illustration of the candidate memory cell is shown in Fig. 9.2.2. 
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Fig. 9.2.2: Computing the candidate memory cell in an LSTM model. 


Memory Cell 


In GRUs, we have a mechanism to govern input and forgetting (or skipping). Similarly, in LSTMs 
we have two dedicated gates for such purposes: the input gate I, governs how much we take new 
data into account via C, and the forget gate F; addresses how much of the old memory cell content 
C,_1 € R”*? we retain. Using the same pointwise multiplication trick as before, we arrive at the 
following update equation: 


C; =F¿0C; +LO C.. (9.2.3) 


If the forget gate is always approximately 1 and the input gate is always approximately 0, the past 
memory cells C;_; will be saved over time and passed to the current time step. This design is intro- 
duced to alleviate the vanishing gradient problem and to better capture long range dependencies 
within sequences. 


We thus arrive at the flow diagram in Fig. 9.2.3. 
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Fig. 9.2.3: Computing the memory cell in an LSTM model. 


Hidden State 


Last, we need to define how to compute the hidden state H; € R”*”. This is where the output gate 
comes into play. In LSTM it is simply a gated version of the tanh of the memory cell. This ensures 
that the values of H; are always in the interval (—1, 1). 


H; = O, © tanh(C;). (9.2.4) 


Whenever the output gate approximates 1 we effectively pass all memory information through to 
the predictor, whereas for the output gate close to 0 we retain all the information only within the 
memory cell and perform no further processing. 


Fig. 9.2.4 has a graphical illustration of the data flow. 





Input X, 


FC layer with Elementwise C as 
[o] activation fuction operator J „ “OPY ( oncatenate 


Fig. 9.2.4: Computing the hidden state in an LSTM model. 
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9.2.2 Implementation from Scratch 


Now let us implement an LSTM from scratch. As same as the experiments in Section 8.5, we first 
load the time machine dataset. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import rnn 
npx.set_np() 


batch_size, num_steps = 32, 35 
train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 


Initializing Model Parameters 


Next we need to define and initialize the model parameters. As previously, the hyperparameter 
num_hiddens defines the number of hidden units. We initialize weights following a Gaussian dis- 
tribution with 0.01 standard deviation, and we set the biases to 0. 


def get_lstm_params(vocab_size, num_hiddens, device): 
num_inputs = num_outputs = vocab_size 


def normal(shape) : 
return np.random.normal(scale=0.01, size=shape, ctx=device) 


def three(): 
return (normal((num_inputs, num_hiddens)), 
normal((num_hiddens, num_hiddens)), 
np.zeros(num_hiddens, ctx=device)) 


W_xi, W_hi, b_i = three() # Input gate parameters 

W_xf, W_hf, b_f = three() + Forget gate parameters 

W_xo, W_ho, b_o = three() # Output gate parameters 

W_xc, W_hc, b_c = three() # Candidate memory cell parameters 

# Output layer parameters 

W_hq = normal((num_hiddens, num_outputs)) 

b_q = np.zeros(num_outputs, ctx=device) 

# Attach gradients 

params = [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, 
b_c, W_ha, b_a] 

for param in params: 

param.attach_grad() 
return params 








356 Chapter 9. Modern Recurrent Neural Networks 


Defining the Model 


In the initialization function, the hidden state of the LSTM needs to return an additional mem- 
ory cell with a value of 0 and a shape of (batch size, number of hidden units). Hence we get the 
following state initialization. 


def init_lstm_state(batch_size, num_hiddens, device): 
return (np.zeros((batch_size, num_hiddens), ctx=device), 
np.zeros((batch_size, num_hiddens), ctx=device)) 


The actual model is defined just like what we discussed before: providing three gates and an aux- 
iliary memory cell. Note that only the hidden state is passed to the output layer. The memory cell 
C, does not directly participate in the output computation. 


def 1stm(inputs, state, params): 
[W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, 
W_hq, b_q] = params 
(H, C) = state 
outputs = [] 
for X in inputs: 
I = npx.sigmoid(np.dot(X, W_xi) + np.dot(H, W_hi) + b_i) 
F = npx.sigmoid(np.dot(X, W_xf) + np.dot(H, W_hf) + b_f) 
O = npx.sigmoid(np.dot(X, W_xo) + np.dot(H, W_ho) + b_o) 
C_tilda = np.tanh(np.dot(X, W_xc) + np.dot(H, W_hc) + b_c) 
Cla (es dls (Cp 
H = O x np.tanh(C) 
Y = np.dot(H, W_hq) + b_q 
outputs .append(Y) 
return np.concatenate(outputs, axis=0), (H, C) 





Training and Prediction 


Let us train an LSTM as same as what we did in Section 9.1, by instantiating the RNNModelScratch 
class as introduced in Section 8.5. 


vocab_size, num_hiddens, device = len(vocab), 256, d21.try_gpu() 

num_epochs, lr = 500, 1 

model = d21.RNNModelScratch(len(vocab), num_hiddens, device, get_lstm_params, 
init_lstm_state, lstm) 

d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.4, 9964.1 tokens/sec on gpu(0) 
time traveller thind wist tracelures of the tron the megis not y 
traveller about the room and set of the grong sed to the ot 
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9.2.3 Concise Implementation 


Using high-level APIs, we can directly instantiate an LSTM model. This encapsulates all the con- 
figuration details that we made explicit above. The code is significantly faster as it uses compiled 
operators rather than Python for many details that we spelled out in detail before. 


1stm_layer = rnn.LSTM(num_hiddens) 
model = d21.RNNModel(1stm_layer, len(vocab)) 
d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.2, 172569.4 tokens/sec on gpu(0) 
time traveller wat so it bechmarkon and shilly unaco right anlle 
travelleryou can show black is white by argument said filby 
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LSTMs are the prototypical latent variable autoregressive model with nontrivial state control. 
Many variants thereof have been proposed over the years, e.g., multiple layers, residual con- 
nections, different types of regularization. However, training LSTMs and other sequence models 
(such as GRUs) are quite costly due to the long range dependency of the sequence. Later we will 
encounter alternative models such as Transformers that can be used in some cases. 
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Summary 


+ LSTMs have three types of gates: input gates, forget gates, and output gates that control the 
flow of information. 


+ The hidden layer output of LSTM includes the hidden state and the memory cell. Only the 
hidden state is passed into the output layer. The memory cell is entirely internal. 


+ LSTMs can alleviate vanishing and exploding gradients. 


Exercises 


1. Adjust the hyperparameters and analyze the their influence on running time, perplexity, and 
the output sequence. 


2. How would you need to change the model to generate proper words as opposed to sequences 
of characters? 


3. Compare the computational cost for GRUs, LSTMs, and regular RNNs for a given hidden 
dimension. Pay special attention to the training and inference cost. 


4. Since the candidate memory cell ensures that the value range is between —1 and 1 by using 
the tanh function, why does the hidden state need to use the tanh function again to ensure 
that the output value range is between —1 and 1? 


5. Implement an LSTM model for time series prediction rather than character sequence pre- 
diction. 


Discussions! 


9.3 Deep Recurrent Neural Networks 


Up to now, we only discussed RNNs with a single unidirectional hidden layer. In it the specific 
functional form of how latent variables and observations interact is rather arbitrary. This is not a 
big problem as long as we have enough flexibility to model different types of interactions. With a 
single layer, however, this can be quite challenging. In the case of the linear models, we fixed this 
problem by adding more layers. Within RNNs this is a bit trickier, since we first need to decide 
how and where to add extra nonlinearity. 


In fact, we could stack multiple layers of RNNs on top of each other. This results in a flexible 
mechanism, due to the combination of several simple layers. In particular, data might be relevant 
at different levels of the stack. For instance, we might want to keep high-level data about financial 
market conditions (bear or bull market) available, whereas at a lower level we only record shorter- 
term temporal dynamics. 


Beyond all the above abstract discussion it is probably easiest to understand the family of models 
we are interested in by reviewing Fig. 9.3.1. It describes a deep RNN with L hidden layers. Each 
hidden state is continuously passed to both the next time step of the current layer and the current 
time step of the next layer. 





"3 https://discuss.d21.ai/t/343 
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Fig. 9.3.1: Architecture of a deep RNN. 


9.3.1 Functional Dependencies 


We can formalize the functional dependencies within the deep architecture of L hidden layers 
depicted in Fig. 9.3.1. Our following discussion focuses primarily on the vanilla RNN model, but 
it applies to other sequence models, too. 


Suppose that we have a minibatch input X; e R”*? (number of examples: n, number of inputs 
in each example: d) at time step t. At the same time step, let the hidden state of the I hidden 
layer (l = 1,...,L) be H” c R”*? (number of hidden units: h) and the output layer variable be 
O, € R”*9 (number of outputs: q). Setting H0 = X;, the hidden state of the /* hidden layer that 
uses the activation function ¢, is expressed as follows: 

HP =óI(H, Wy), + HW), + by), (9.3.1) 


where the weights wi) c R?X? and w € R"*", together with the bias b(” € R!*", are the model 
parameters of the /'" hidden layer. 


In the end, the calculation of the output layer is only based on the hidden state of the final L 
hidden layer: 


O; = HOW, + by, (9.3.2) 


where the weight W;,, € R”*% and the bias b, € R!*4 are the model parameters of the output 
layer. 


Just as with MLPs, the number of hidden layers L and the number of hidden units h are hyperpa- 
rameters. In other words, they can be tuned or specified by us. In addition, we can easily get a 
deep gated RNN by replacing the hidden state computation in (9.3.1) with that from a GRU or an 
LSTM. 
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9.3.2 Concise Implementation 


Fortunately many of the logistical details required to implement multiple layers of an RNN are 
readily available in high-level APIs. To keep things simple we only illustrate the implementation 
using such built-in functionalities. Let us take an LSTM model as an example. The code is very 
similar to the one we used previously in Section 9.2. In fact, the only difference is that we specify 
the number of layers explicitly rather than picking the default of a single layer. As usual, we begin 
by loading the dataset. 


from d21 import mxnet as d21 
from mxnet import npx 

from mxnet.gluon import rnn 
npx.set_np() 


batch_size, num_steps = 32, 35 
train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 


The architectural decisions such as choosing hyperparameters are very similar to those of Section 
9.2. We pick the same number of inputs and outputs as we have distinct tokens, i.e., vocab_size. 
The number of hidden units is still 256. The only difference is that we now select a nontrivial 
number of hidden layers by specifying the value of num_layers. 


vocab_size, num_hiddens, num_layers = len(vocab), 256, 2 
device = d21.try_gpu() 

1stm_layer = rnn.LSTM(num_hiddens, num_layers) 

model = d21.RNNModel (1stm_layer, len(vocab)) 


9.3.3 Training and Prediction 


Since now we instantiate two layers with the LSTM model, this rather more complex architecture 
slows down training considerably. 


num_epochs, Ir = 500, 2 
d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.0, 118064.4 tokens/sec on gpu(0) 
time travelleryou can show black is white by argument said filby 
travelleryou can show black is white by argument said filby 
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Summary 


e In deep RNNs, the hidden state information is passed to the next time step of the current 
layer and the current time step of the next layer. 


e There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or vanilla RNNs. 
Conveniently these models are all available as parts of the high-level APIs of deep learning 
frameworks. 


e Initialization of models requires care. Overall, deep RNNs require considerable amount of 
work (such as learning rate and clipping) to ensure proper convergence. 


Exercises 
1. Try to implement a two-layer RNN from scratch using the single layer implementation we 
discussed in Section 8.5. 
2. Replace the LSTM by a GRU and compare the accuracy and training speed. 


3. Increase the training data to include multiple books. How low can you go on the perplexity 
scale? 


4. Would you want to combine sources of different authors when modeling text? Why is this a 
good idea? What could go wrong? 


Discussions!!* 





14 https://discuss.d21.ai/t/340 





362 Chapter 9. Modern Recurrent Neural Networks 


9.4 Bidirectional Recurrent Neural Networks 


In sequence learning, so far we assumed that our goal is to model the next output given what we 
have seen so far, e.g., in the context of a time series or in the context of a language model. While 
this is a typical scenario, it is not the only one we might encounter. To illustrate the issue, consider 
the following three tasks of filling in the blank in a text sequence: 


«Tlam___ 


e Tam ___ hungry. 


e Tam ___ hungry, and I can eat half a pig. 


Depending on the amount of information available, we might fill in the blanks with very differ- 
ent words such as “happy”, “not”, and “very”. Clearly the end of the phrase (if available) conveys 
significant information about which word to pick. A sequence model that is incapable of taking 
advantage of this will perform poorly on related tasks. For instance, to do well in named entity 
recognition (e.g., to recognize whether “Green” refers to “Mr. Green” or to the color) longer-range 
context is equally vital. To get some inspiration for addressing the problem let us take a detour to 


probabilistic graphical models. 


9.4.1 Dynamic Programming in Hidden Markov Models 


This subsection serves to illustrate the dynamic programming problem. The specific technical 
details do not matter for understanding the deep learning models but they help in motivating why 
one might use deep learning and why one might pick specific architectures. 


If we want to solve the problem using probabilistic graphical models we could for instance design 
a latent variable model as follows. At any time step t, we assume that there exists some latent 
variable h; that governs our observed emission x; via P(x; | ht). Moreover, any transition hy —> 
ht+1 is given by some state transition probability P(h+,1 | h+). This probabilistic graphical model 
is then a hidden Markov model as in Fig. 9.4.1. 





Fig. 9.4.1: A hidden Markov model. 


Thus, for a sequence of T observations we have the following joint probability distribution over 
the observed and hidden states: 


T 
P(ai,...,07,h1,...,hr) = | | Pre | he-1)P(ae | he), where P(ha | ho) = P(hx). (9.4.1) 
t=1 


Now assume that we observe all x; with the exception of some x; and it is our goal to compute 
P(x; | xj), where x_; = (z1,...,£j—1,£j+1,---, v7). Since there is no latent variable in P(z; | 
x_;), we consider summing over all the possible combinations of choices for hı, ..., hr. In case 
any h; can take on k distinct values (a finite number of states), this means that we need to sum over 
kT terms—usually mission impossible! Fortunately there is an elegant solution for this: dynamic 
programming. 
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To see how it works, consider summing over latent variables h,,...,hy in turn. According to 
(9.4.1), this yields: 








P(x1,..., £T) 
= `> P(x1,..., Moo hr) 
hi,- hT 
T 
Y [[ PM | m-1)P: | hi) 
hi1,... hr t=1 
T 
= Y |) P(hi)P(a1 | hi) Phe | ha) | Pez | ha) | [ Pu | bea) P(e | hi) 
ha, hp | hy t=3 
(9.4.2) 
m(h2) £ 
T 
= Y |) me(he)P(a2 | h2)P(hs | ha) | Pes | ha) | [ Pre | hi) P(e | he) 
hg,..,hr | ha t=4 
aha) 
=> rr(hr)P(2r | hr). 
In general we have the forward recursion as 
Ti+r1[hg41) = 2 má (hi) Plz: | hi) P (hi4 | he). (9.4.3) 


The recursion is initialized as 7(hi) = Hi. In abstract terms this can be written as m4) = 
f(t, xt), where f is some learnable function. This looks very much like the update equation in 
the latent variable models we discussed so far in the context of RNNs! 


Entirely analogously to the forward recursion, we can also sum over the same set of latent variables 
with a backward recursion. This yields: 


P(x1,..., 2T) 
= y P(x1,..., £p, hi,..., hr) 


hi1)... hT 


D [e P(ht | hi1) P (xs | ht) > P(hr | hr-1)Pler | hr) 











hi, hr t =1 
= Te (hi | hi1) P (ze | ha) - a P(2r | hr) 

ha,...hp—1 t=1 

(9.4.4) 
pra(hr y E 
T-—2 

= || PO) | ea) Pee | hi) | XO Phr- | hr-2)P(ær-1 | hrs) pr—a(hr-1) 

hi, hr-2 t=1 hr-1 

pr—a(hr—2) E 


Y P(hi)P(a1 | hom). 
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We can thus write the backward recursion as 


pi-1(h-1) = LP (he | he-1)P(ae | he) pe(he), (9.4.5) 


with initialization pr(hr) = 1. Both the forward and backward recursions allow us to sum over 
T latent variables in O(*T') (linear) time over all values of (h,,..., hr) rather than in exponential 
time. Thisis one of the great benefits of the probabilistic inference with graphical models. Itis also 
avery special instance of a general message passing algorithm (Aji & McEliece, 2000). Combining 
both forward and backward recursions, we are able to compute 


Pay [Boj] 2 malta JP (xj | hy). (9.4.6) 


Note that in abstract terms the backward recursion can be written as p;_1 = g(p1, xt), Where gisa 
learnable function. Again, this looks very much like an update equation, just running backwards 
unlike what we have seen so far in RNNs. Indeed, hidden Markov models benefit from knowing 
future data when it is available. Signal processing scientists distinguish between the two cases of 
knowing and not knowing future observations as interpolation v.s. extrapolation. See the intro- 
ductory chapter of the book on sequential Monte Carlo algorithms for more details (Doucet et al., 
2001). 


9.4.2 Bidirectional Model 


If we want to have a mechanism in RNNs that offers comparable look-ahead ability as in hidden 
Markov models, we need to modify the RNN design that we have seen so far. Fortunately, this is 
easy conceptually. Instead of running an RNN only in the forward mode starting from the first 
token, we start another one from the last token running from back to front. Bidirectional RNNs 
add a hidden layer that passes information in a backward direction to more flexibly process such 
information. Fig. 9.4.2 illustrates the architecture of a bidirectional RNN with a single hidden 
layer. 





Fig. 9.4.2: Architecture of a bidirectional RNN. 


In fact, this is not too dissimilar to the forward and backward recursions in the dynamic program- 
ing of hidden Markov models. The main distinction is that in the previous case these equations 
had a specific statistical meaning. Now they are devoid of such easily accessible interpretations 
and we can just treat them as generic and learnable functions. This transition epitomizes many 
of the principles guiding the design of modern deep networks: first, use the type of functional 
dependencies of classical statistical models, and then parameterize them in a generic form. 
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Definition 


Bidirectional RNNs were introduced by (Schuster & Paliwal, 1997). For a detailed discussion of the 
various architectures see also the paper (Graves 8: Schmidhuber, 2005). Let us look at the specifics 
of such a network. 


For any time step t, given a minibatch input X, € R”*4 (number of examples: n, number of inputs 
in each example: d) and let the hidden layer activation function be ¢. In the bidirectional archi- 
tecture, we assume that the forward and backward hidden states for this time step are H, eR” 
and H, € R”*h, respectively, where h is the number of hidden units. The forward and backward 
hidden state updates are as follows: 


H, = (XW + HW +b(0), 


(9.4.7) 
H, = AX WO) + H.W + b), 


where the weights wo) € RXh wh) € Reh wo e RX and we) € R**?, and biases el € 


RES and b) € R1X? are all the model parameters. 


Next, we concatenate the forward and backward hidden states H, and H, to obtain the hidden 
state H; € R”*?" to be fed into the output layer. In deep bidirectional RNNs with multiple hidden 
layers, such information is passed on as input to the next bidirectional layer. Last, the output layer 
computes the output O, € R”*9 (number of outputs: q): 


O, = HW), + by- (9.4.8) 


Here, the weight matrix Wp € R?*1 and the bias b, € R!*? are the model parameters of the 
output layer. In fact, the two directions can have different numbers of hidden units. 


Computational Cost and Applications 


One of the key features of a bidirectional RNN is that information from both ends of the sequence 
is used to estimate the output. That is, we use information from both future and past observations 
to predict the current one. In the case of next token prediction this is not quite what we want. 
After all, we do not have the luxury of knowing the next to next token when predicting the next 
one. Hence, if we were to use a bidirectional RNN naively we would not get a very good accuracy: 
during training we have past and future data to estimate the present. During test time we only 
have past data and thus poor accuracy. We will illustrate this in an experiment below. 


To add insult to injury, bidirectional RNNs are also exceedingly slow. The main reasons for this 
are that the forward propagation requires both forward and backward recursions in bidirectional 
layers and that the backpropagation is dependent on the outcomes of the forward propagation. 
Hence, gradients will have a very long dependency chain. 


In practice bidirectional layers are used very sparingly and only for a narrow set of applications, 
such as filling in missing words, annotating tokens (e.g., for named entity recognition), and en- 
coding sequences wholesale as a step in a sequence processing pipeline (e.g., for machine transla- 
tion). In Section 14.8 and Section 15.2, we will introduce how to use bidirectional RNNs to encode 
text sequences. 
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9.4.3 Training a Bidirectional RNN for a Wrong Application 


If we were to ignore all advice regarding the fact that bidirectional RNNs use past and future data 
and simply apply itto language models, we will get estimates with acceptable perplexity. Nonethe- 
less, the ability of the model to predict future tokens is severely compromised as the experiment 
below illustrates. Despite reasonable perplexity, it only generates gibberish even after many it- 
erations. We include the code below as a cautionary example against using them in the wrong 
context. 


from d21 import mxnet as d21 
from mxnet import npx 

from mxnet.gluon import rnn 
npx.set_np() 


# Load data 

batch_size, num_steps, device = 32, 35, d21.try_gpu() 

train_iter, vocab = d21.load_data_time_machine(batch_size, num_steps) 
# Define the bidirectional LSTM model by setting ‘bidirectional=True* 
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2 

1stm_layer = rnn.LSTM(num_hiddens, num_layers, bidirectional=True) 
model = d21.RNNModel (1stm_layer, len(vocab)) 

# Train the model 

num_epochs, Ir = 500, 1 

d21.train_ch8(model, train_iter, vocab, lr, num_epochs, device) 


perplexity 1.2, 72320.0 tokens/sec on gpu(0) 
time travellerererererererererererererererererererererererererer 
travellerererererererererererererererererererererererererer 


15 


10 


perplexity 


100 200 300 400 500 
epoch 


The output is clearly unsatisfactory for the reasons described above. For a discussion of more 
effective uses of bidirectional RNNs, please see the sentiment analysis application in Section 15.2. 
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Summary 


e In bidirectional RNNs, the hidden state for each time step is simultaneously determined by 
the data prior to and after the current time step. 


e Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in 
probabilistic graphical models. 


° Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observa- 
tions given bidirectional context. 


e Bidirectional RNNs are very costly to train due to long gradient chains. 


Exercises 


1. If the different directions use a different number of hidden units, how will the shape of H, 
change? 


2. Design a bidirectional RNN with multiple hidden layers. 


3. Polysemy is common in natural languages. For example, the word “bank” has different 
meanings in contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”. 
How can we design a neural network model such that given a context sequence and a word, 
a vector representation of the word in the context will be returned? What type of neural 
architectures is preferred for handling polysemy? 


Discussions?!” 


9.5 Machine Translation and the Dataset 


We have used RNNs to design language models, which are key to natural language processing. 
Another flagship benchmark is machine translation, a central problem domain for sequence trans- 
duction models that transform input sequences into output sequences. Playing a crucial role in 
various modern Al applications, sequence transduction models will form the focus of the remain- 
der of this chapter and Chapter 10. To this end, this section introduces the machine translation 
problem and its dataset that will be used later. 


Machine translation refers to the automatic translation of a sequence from one language to another. 
In fact, this field may date back to 1940s soon after digital computers were invented, especially 
by considering the use of computers for cracking language codes in World War II. For decades, 
statistical approaches had been dominant in this field (Brown et al., 1988, 1990) before the rise of 
end-to-end learning using neural networks. The latter is often called neural machine translation to 
distinguish itself from statistical machine translation that involves statistical analysis in components 
such as the translation model and the language model. 


Emphasizing end-to-end learning, this book will focus on neural machine translation methods. 
Different from our language model problem in Section 8.3 whose corpus is in one single language, 
machine translation datasets are composed of pairs of text sequences that are in the source lan- 
guage and the target language, respectively. Thus, instead of reusing the preprocessing routine 
for language modeling, we need a different way to preprocess machine translation datasets. In 
the following, we show how to load the preprocessed data into minibatches for training. 





15 https://discuss.d21.ai/t/339 
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from d21 import mxnet as d21 
from mxnet import np, npx 
import os 

npx.set_np() 


9.5.1 Downloading and Preprocessing the Dataset 


To begin with, we download an English-French dataset that consists of bilingual sentence pairs 
from the Tatoeba Project!*?, Each line in the dataset is a tab-delimited pair of an English text 
sequence and the translated French text sequence. Note that each text sequence can be just one 
sentence or a paragraph of multiple sentences. In this machine translation problem where English 
is translated into French, English is the source language and French is the target language. 


#@save 
d21.DATA_HUB[ 'fra-eng'] = (d21.DATA_URL + 'fra-eng.zip', 
'"94646ad1522d915e7b0f9296181140edcf8624f5'>) 


#@save 
def read_data_nmt(): 
"""Load the English-French dataset.””” 
data_dir = d21.download_extract('fra-eng') 
with open(os.path.join(data_dir, 'fra.txt'), 'r') as f: 
return f.read() 


raw_text = read_data_nmt() 
print(raw_text[:75]) 


Go. Va ! 

Hi. Salut ! 

Run! Cours ! 
Run! Courez ! 
Who? Qui ? 

Wow! Ca alors ! 


After downloading the dataset, we proceed with several preprocessing steps for the raw text data. 
For instance, we replace non-breaking space with space, convert uppercase letters to lowercase 
ones, and insert space between words and punctuation marks. 


#@save 
def preprocess_nmt(text): 
"""Preprocess the English-French dataset. 
def no_space(char, prev_char): 
return char in set(’,.!?') and prev_char != 


nnn 


i} 


# Replace non-breaking space with space, and convert uppercase letters to 
# lowercase ones 


text = text.replace(’\u202f', ' ').replace(’\xa@’, ' ').lower() 
# Insert space between words and punctuation marks 
out = [’ ' + char if i > @ and no_space(char, text[i - 1]) else char 


for i, char in enumerate(text)] 


(continues on next page) 





"6 http://www.manythings.org/anki/ 
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return '’.join(out) 


text = preprocess_nmt(raw_text) 
print(text[:80]) 


go. 
hi . 


run 
run 
who 
wow 


va ! 

salut ! 
cours ! 
courez ! 
qui ? 

ça alors ! 


æ= j = ome 


9.5.2 Tokenization 


(continued from previous page) 


Different from character-level tokenization in Section 8.3, for machine translation we prefer word- 
level tokenization here (state-of-the-art models may use more advanced tokenization techniques). 
The following tokenize_nmt function tokenizes the the first num_examples text sequence pairs, 
where each token is either a word or a punctuation mark. This function returns two lists of to- 
ken lists: source and target. Specifically, source[i] is a list of tokens from the i“ text sequence 
in the source language (English here) and target[i] is that in the target language (French here). 


#@save 


def 


tokenize_nmt(text, num_examples=None) : 
"""Tokenize the English-French dataset. 
source, target = [], [] 


nnn 


for i, line in enumerate(text.split(’\n’)): 


if num_examples and i > num_examples: 
break 
parts = line.split('1t') 
if len(parts) == 2: 
source.append(parts[@].split(’ ')) 
target.append(parts[1].split(' ')) 
return source, target 


source, target = tokenize_nmt(text) 
source[:6], target[:6] 


Gis 
È 
E 
P 
E 
E 
UE" 
E 
E 
E 
E 
E 


qa o de 
ale 
rum, EU, 
rum’, Al, 
ado, “2%, 
wow’, ‘!’]], 
We ee als 
seuluje”, 7) “al, 
COURSE 
COURSE 
equi”, R” o 
CAPO SA D 


Let us plot the histogram of the number of tokens per text sequence. In this simple English-French 
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dataset, most of the text sequences have fewer than 20 tokens. 
d21.set_figsize() 
_, -, patches = d21.plt.hist( 
[Elen(1) for 1 in source], [len(1) for 1 in target]], 
label=['source’, 'target'1) 
for patch in patches[1].patches: 


patch.set_hatch('/'> 
d21.plt.legend(loc='upper right'); 


100000 
80000 
60000 
40000 
20000 


0 





9.5.3 Vocabulary 


Since the machine translation dataset consists of pairs of languages, we can build two vocabularies 
for both the source language and the target language separately. With word-level tokenization, 
the vocabulary size will be significantly larger than that using character-level tokenization. To 
alleviate this, here we treat infrequent tokens that appear less than 2 times as the same unknown 
(“<unk>”) token. Besides that, we specify additional special tokens such as for padding (“<pad>”) 
sequences to the same length in minibatches, and for marking the beginning (“<bos>”) or end 
(“<eos>”) of sequences. Such special tokens are commonly used in natural language processing 
tasks. 


src_vocab = d21.Vocab(source, min_freq=2, 


reserved_tokens=[’<pad>’, '<bos>', '<eos>']) 
len(src_vocab) 


10012 
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9.5.4 Loading the Dataset 


Recall that in language modeling each sequence example, either a segment of one sentence or a 
span over multiple sentences, has a fixed length. This was specified by the num_steps (number of 
time steps or tokens) argument in Section 8.3. In machine translation, each example is a pair of 
source and target text sequences, where each text sequence may have different lengths. 


For computational efficiency, we can still process a minibatch of text sequences at one time by 
truncation and padding. Suppose that every sequence in the same minibatch should have the same 
length num_steps. If a text sequence has fewer than num_steps tokens, we will keep appending the 
special “<pad>” token to its end until its length reaches num_steps. Otherwise, we will truncate 
the text sequence by only taking its first num_steps tokens and discarding the remaining. In this 
way, every text sequence will have the same length to be loaded in minibatches of the same shape. 


The following truncate_pad function truncates or pads text sequences as described before. 


#@save 
def truncate_pad(line, num_steps, padding_token): 
"""Truncate or pad sequences.””” 
if len(line) > num_steps: 
return line[:num_steps] + Truncate 
return line + [padding_token] * (num_steps - len(line)) + Pad 


truncate_pad(src_vocab[source[@]], 10, src_vocab['<pad>']) 


Pa Zh, al ly al, aly o Ty ay al 


Now we define a function to transform text sequences into minibatches for training. We append 
the special “<eos>” token to the end of every sequence to indicate the end of the sequence. When 
a model is predicting by generating a sequence token after token, the generation of the “<eos>” 
token can suggest that the output sequence is complete. Besides, we also record the length of each 
text sequence excluding the padding tokens. This information will be needed by some models that 
we will cover later. 


#@save 
def build_array_nmt(lines, vocab, num_steps): 
"""Transform text sequences of machine translation into minibatches. 
lines = [vocab[1] for 1 in lines] 
lines = [1 + [vocab['<eos>']] for 1 in lines] 
array = np.array( 
[truncate_pad(1, num_steps, vocab[ '<pad>']) for 1 in lines]) 
valid_len = (array != vocab[ '<pad>'1).astype(np.int32).sum(1) 
return array, valid_len 


nnn 





372 Chapter 9. Modern Recurrent Neural Networks 


9.5.5 Putting All Things Together 


Finally, we define the load_data_nmt function to return the data iterator, together with the vocab- 
ularies for both the source language and the target language. 


#@save 
def load_data_nmt(batch_size, num_steps, num_examples=600): 
"" "Return the iterator and the vocabularies of the translation dataset. 
text = preprocess_nmt(read_data_nmt()) 
source, target = tokenize_nmt(text, num_examples) 
src_vocab = d21.Vocab(source, min_freg=2, 


nnn 


reserved_tokens=['<pad>', '<bos>', '<eos>']) 
tgt_vocab = d21.Vocab(target, min_freg=2, 
reserved_tokens=[ '<pad>', '<bos>', '<eos>']) 


src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps) 
tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps) 
data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len) 
data_iter = d21.load_array(data_arrays, batch_size) 

return data_iter, src_vocab, tgt_vocab 


Let us read the first minibatch from the English-French dataset. 


train_iter, src_vocab, tgt_vocab = load_data_nmt(batch_size=2, num_steps=8) 
for X, X_valid_len, Y, Y_valid_len in train_iter: 

print(’X:', X.astype(np.int32)) 

print('valid lengths for X:', X_valid_len) 

print("Y:', Y.astype(np. int32)) 

print('valid lengths for Y:’, Y_valid_len) 

break 


ee DEU Ay A 
EO ile see abs} al i aaj 
valid lengths for X: [4 5] 
MEMES ZAS Ss S 3 d l m 
[63317 4 3 dd a ai 
valid lengths for Y: [5 5] 


Summary 


e Machine translation refers to the automatic translation of a sequence from one language to 
another. 


+ Using word-level tokenization, the vocabulary size will be significantly larger than that using 
character-level tokenization. To alleviate this, we can treat infrequent tokens as the same 
unknown token. 


e We can truncate and pad text sequences so that all of them will have the same length to be 
loaded in minibatches. 
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Exercises 


1. Try different values of the num_examples argument in the load_data_nmt function. How does 
this affect the vocabulary sizes of the source language and the target language? 


2. Text in some languages such as Chinese and Japanese does not have word boundary indi- 
cators (e.g., space). Is word-level tokenization still a good idea for such cases? Why or why 
not? 


Discussions!” 


9.6 Encoder-Decoder Architecture 


As we have discussed in Section 9.5, machine translation is a major problem domain for sequence 
transduction models, whose input and output are both variable-length sequences. To handle this 
type of inputs and outputs, we can design an architecture with two major components. The first 
component is an encoder: it takes a variable-length sequence as the input and transforms it into 
a state with a fixed shape. The second component is a decoder: it maps the encoded state of a 
fixed shape to a variable-length sequence. This is called an encoder-decoder architecture, which is 
depicted in Fig. 9.6.1. 


Fig. 9.6.1: The encoder-decoder architecture. 


Let us take machine translation from English to French as an example. Given an input sequence 
in English: “They”, “are”, “watching”, “.”, this encoder-decoder architecture first encodes the 
variable-length input into a state, then decodes the state to generate the translated sequence token 
by token as the output: “Ils”, “regardent”, “”. Since the encoder-decoder architecture forms the 
basis of different sequence transduction models in subsequent sections, this section will convert 


this architecture into an interface that will be implemented later. 


9.6.1 Encoder 


In the encoder interface, we just specify that the encoder takes variable-length sequences as the 
input X. The implementation will be provided by any model that inherits this base Encoder class. 


from mxnet.gluon import nn 


#@save 
class Encoder(nn.Block): 
"""The base encoder interface for the encoder-decoder architecture. 
def __init__(self, **kwargs): 
super(Encoder, self).__init__(**kwargs) 


nnn 


def forward(self, X, xargs): 
raise NotImplementedError 





W https://discuss.d21.ai/t/344 
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9.6.2 Decoder 


In the following decoder interface, we add an additional init_state function to convert the en- 
coder output (enc_outputs) into the encoded state. Note that this step may need extra inputs such 
asthe valid length ofthe input, which was explained in Section 9.5.4. To generate a variable-length 
sequence token by token, every time the decoder may map an input (e.g., the generated token at 
the previous time step) and the encoded state into an output token at the current time step. 


#@save 
class Decoder(nn.Block): 
"""The base decoder interface for the encoder-decoder architecture. 
def __init__(self, **kwargs): 
super(Decoder, self).__init__(**kwargs) 


nnn 


def init_state(self, enc_outputs, xargs): 
raise NotImplementedError 


def forward(self, X, state): 
raise NotImplementedError 


9.6.3 Putting the Encoder and Decoder Together 


In the end, the encoder-decoder architecture contains both an encoder and a decoder, with op- 
tionally extra arguments. In the forward propagation, the output of the encoder is used to produce 
the encoded state, and this state will be further used by the decoder as one of its input. 


#@save 
class EncoderDecoder (nn.Block): 
"""The base class for the encoder-decoder architecture. 
def __init__(self, encoder, decoder, **kwargs): 
super(EncoderDecoder, self).__init__(**kwargs) 
self.encoder = encoder 
self.decoder = decoder 


nnn 


def forward(self, enc_X, dec_X, xargs): 
enc_outputs = self.encoder(enc_X, xargs) 
dec_state = self.decoder.init_state(enc_outputs, xargs) 
return self.decoder(dec_X, dec_state) 


The term “state” in the encoder-decoder architecture has probably inspired you to implement this 
architecture using neural networks with states. In the next section, we will see how to apply RNNs 
to design sequence transduction models based on this encoder-decoder architecture. 
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Summary 


* The encoder-decoder architecture can handle inputs and outputs that are both variable- 
length sequences, thus is suitable for sequence transduction problems such as machine 
translation. 


* The encoder takes a variable-length sequence as the input and transforms it into a state with 
a fixed shape. 


* The decoder maps the encoded state of a fixed shape to a variable-length sequence. 


Exercises 


1. Suppose that we use neural networks to implement the encoder-decoder architecture. Do 
the encoder and the decoder have to be the same type of neural network? 


2. Besides machine translation, can you think of another application where the encoder- 
decoder architecture can be applied? 


Discussions!!8 


9.7 Sequence to Sequence Learning 


As we have seen in Section 9.5, in machine translation both the input and output are a variable- 
length sequence. To address this type of problem, we have designed a general encoder-decoder 
architecture in Section 9.6. In this section, we will use two RNNs to design the encoder and the 
decoder of this architecture and apply it to sequence to sequence learning for machine translation 
(Sutskever et al., 2014; Cho et al., 2014b). 


Following the design principle of the encoder-decoder architecture, the RNN encoder can take a 
variable-length sequence as the input and transforms it into a fixed-shape hidden state. In other 
words, information of the input (source) sequence is encoded in the hidden state of the RNN en- 
coder. To generate the output sequence token by token, a separate RNN decoder can predict the 
next token based on what tokens have been seen (such as in language modeling) or generated, 
together with the encoded information of the input sequence. Fig. 9.7.1 illustrates how to use two 
RNNs for sequence to sequence learning in machine translation. 


Encoder Decoder 


lls regardent . <eos> 






m- 


They are watching 






<eos> 1 1 1 


i} 
<bos> Ils regardent 


Fig. 9.7.1: Sequence to sequence learning with an RNN encoder and an RNN decoder. 





"8 https://discuss.d21.ai/t/341 
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In Fig. 9.7.1, the special “<eos>” token marks the end of the sequence. The model can stop making 
predictions once this token is generated. Atthe initial time step ofthe RNN decoder, there are two 
special design decisions. First, the special beginning-of-sequence “<bos>” token is an input. Sec- 
ond, the final hidden state of the RNN encoder is used to initiate the hidden state of the decoder. 
In designs such as (Sutskever et al., 2014), this is exactly how the encoded input sequence infor- 
mation is fed into the decoder for generating the output (target) sequence. In some other designs 
such as (Cho et al., 2014b), the final hidden state of the encoder is also fed into the decoder as part 
of the inputs at every time step as shown in Fig. 9.7.1. Similar to the training of language models 
in Section 8.3, we can allow the labels to be the original output sequence, shifted by one token: 


“<bos>”, “Ils”, “regardent”, “” — “Ils”, “regardent”, “”, “<eos>”. 


In the following, we will explain the design of Fig. 9.7.1 in greater detail. We will train this model 
for machine translation on the English-French dataset as introduced in Section 9.5. 


import collections 

from d21 import mxnet as d21 

import math 

from mxnet import np, npx, init, gluon, autograd 
from mxnet.gluon import nn, rnn 

npx.set_np() 


9.7.1 Encoder 


Technically speaking, the encoder transforms an input sequence of variable length into a fixed- 
shape context variable c, and encodes the input sequence information in this context variable. As 
depicted in Fig. 9.7.1, we can use an RNN to design the encoder. 


Let us consider a sequence example (batch size: 1). Suppose that the input sequence is 71,..., x7, 
such that z+ is the t token in the input text sequence. At time step t, the RNN transforms the input 
feature vector x; for x; and the hidden state h;—ı from the previous time step into the current 
hidden state h;. We can use a function f to express the transformation of the RNN’s recurrent 
layer: 


h; = f (Xt, hi1). (9.7.1) 


In general, the encoder transforms the hidden states at all the time steps into the context variable 
through a customized function q: 


c= q(h;, ee , h7). (9.7.2) 


For example, when choosing q(h;,..., hr) = hy such as in Fig. 9.7.1, the context variable is just 
the hidden state hr of the input sequence at the final time step. 


So far we have used a unidirectional RNN to design the encoder, where a hidden state only de- 
pends on the input subsequence at and before the time step of the hidden state. We can also 
construct encoders using bidirectional RNNSs. In this case, a hidden state depends on the subse- 
quence before and after the time step (including the input at the current time step), which encodes 
the information of the entire sequence. 


Now let us implement the RNN encoder. Note that we use an embedding layer to obtain the feature 
vector for each token in the input sequence. The weight of an embedding layer is a matrix whose 
number of rows equals to the size of the input vocabulary (vocab_size) and number of columns 
equals to the feature vector's dimension (embed_size). For any input token index ¿, the embedding 
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layer fetches the i“ row (starting from 0) of the weight matrix to return its feature vector. Besides, 
here we choose a multilayer GRU to implement the encoder. 


#@save 
class Seq2SeqEncoder(d21.Encoder): 
"""The RNN encoder for sequence to sequence learning. 
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=0, *xkwargs): 
super (Seq2SeqEncoder, self).__init__(**kwargs) 
# Embedding layer 
self.embedding = nn.Embedding(vocab_size, embed_size) 
self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout) 


nnn 


def forward(self, X, xargs): 
# The output *X* shape: ('batch_size', 'num_steps*, `embed_size`) 
X = self.embedding(X) 
# In RNN models, the first axis corresponds to time steps 
X = X.swapaxes(0, 1) 
state = self.rnn.begin_state(batch_size=X.shape[1], ctx=X.ctx) 
output, state = self.rnn(X, state) 
# ‘output’ shape: ('num_steps', ‘batch_size*, ‘num_hiddens*) 
# 'state[0]' shape: ('num_layers*, ‘batch_size‘, ‘num_hiddens*) 
return output, state 


The returned variables of recurrent layers have been explained in Section 8.6. Let us still use a 
concrete example to illustrate the above encoder implementation. Below we instantiate a two- 
layer GRU encoder whose number of hidden units is 16. Given a minibatch of sequence inputs 
X (batch size: 4, number of time steps: 7), the hidden states of the last layer at all the time steps 
(output return by the encoder’s recurrent layers) are a tensor of shape (number of time steps, 
batch size, number of hidden units). 


encoder = Seq2SeqEncoder (vocab_size=10, embed_size=8, num_hiddens=16, 
num_layers=2) 

encoder. initialize() 

X = np.zeros((4, 7)) 

output, state = encoder(X) 

output. shape 


(7, 4, 16) 
Since a GRU is employed here, the shape of the multilayer hidden states at the final time step is 


(number of hidden layers, batch size, number of hidden units). If an LSTM is used, memory cell 
information will also be contained in state. 


len(state), state[0].shape 


(1, (2, 4, 16)) 
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9.7.2 Decoder 


As we just mentioned, the context variable c of the encoder's output encodes the entire input 
sequence 213,..., £r. Given the output sequence y1, y2,..., Yr’ from the training dataset, for each 
time step t' (the symbol differs from the time step t of input sequences or encoders), the probability 
of the decoder output yy is conditional on the previous output subsequence y;,...,Yy—1 and the 
context variable c, i.e., P(yy | y1,.--, Yer—1,€). 


To model this conditional probability on sequences, we can use another RNN as the decoder. At 
any time step t' on the output sequence, the RNN takes the output y; ¡ from the previous time step 
and the context variable c as its input, then transforms them and the previous hidden state Ss; _¡ 
into the hidden state sy at the current time step. As a result, we can use a function g to express 
the transformation of the decoder's hidden layer: 


Sy = 9(Yy—1,C,Sp-_1). (9.7.3) 


After obtaining the hidden state of the decoder, we can use an output layer and the softmax oper- 
ation to compute the conditional probability distribution P(yy | y1,...,Yy-1,€) for the output at 
time step 1”. 


Following Fig. 9.7.1, when implementing the decoder as follows, we directly use the hidden state 
at the final time step of the encoder to initialize the hidden state of the decoder. This requires 
that the RNN encoder and the RNN decoder have the same number of layers and hidden units. To 
further incorporate the encoded input sequence information, the context variable is concatenated 
with the decoder input at all the time steps. To predict the probability distribution of the output 
token, a fully-connected layer is used to transform the hidden state at the final layer of the RNN 
decoder. 


class Seq2SeqDecoder(d21.Decoder): 
"""The RNN decoder for sequence to sequence learning. 
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=0, **kwargs): 
super (Seq2SeqDecoder, self).__init__(**kwargs) 
self.embedding = nn.Embedding(vocab_size, embed_size) 
self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout) 
self.dense = nn.Dense(vocab_size, flatten=False) 


nnn 


def init_state(self, enc_outputs, xargs): 
return enc_outputs[1] 


def forward(self, X, state): 
# The output *X' shape: ('`num_steps`, ‘batch_size*, `embed_size`) 
X = self .embedding(X).swapaxes(0, 1) 
# *context' shape: (‘batch_size*, ‘num_hiddens*) 
context = state[0][-1] 
# Broadcast 'context' so it has the same 'num_steps' as ‘X* 
context = np.broadcast_to(context, ( 
X.shapelQ], context.shapelQ@], context.shape[1])) 
X_and_context = np.concatenate((X, context), 2) 
output, state = self.rnn(X_and_context, state) 
output = self .dense(output).swapaxes(0, 1) 
# ‘output’ shape: ('batch_size', ‘num_steps*‘, ‘vocab_size*) 
# ‘state[@]*‘ shape: (‘num_layers*, ‘batch_size*, ‘num_hiddens*) 
return output, state 
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To illustrate the implemented decoder, below we instantiate it with the same hyperparameters 
from the aforementioned encoder. As we can see, the output shape of the decoder becomes (batch 
size, number of time steps, vocabulary size), where the last dimension of the tensor stores the 
predicted token distribution. 


decoder = Seg2SeqDecoder(vocab_size=10, embed_size=8, num_hiddens=16, 
num_layers=2) 

decoder. initialize() 

state = decoder.init_state(encoder (X)) 

output, state = decoder(X, state) 

output.shape, len(state), state[0].shape 


CA, 7 MOa ia o 4, 10) 
To summarize, the layers in the above RNN encoder-decoder model are illustrated in Fig. 9.7.2. 


Encoder Decoder 


ren) = 


nx 


Embedding 








Embedding 


Sources Targets 


Fig. 9.7.2: Layers in an RNN encoder-decoder model. 


9.7.3 Loss Function 


At each time step, the decoder predicts a probability distribution for the output tokens. Similar 
to language modeling, we can apply softmax to obtain the distribution and calculate the cross- 
entropy loss for optimization. Recall Section 9.5 that the special padding tokens are appended to 
the end of sequences so sequences of varying lengths can be efficiently loaded in minibatches of 
the same shape. However, prediction of padding tokens should be excluded from loss calculations. 


To this end, we can use the following sequence_mask function to mask irrelevant entries with zero 
values so later multiplication of any irrelevant prediction with zero equals to zero. For example, 
if the valid length of two sequences excluding padding tokens are one and two, respectively, the 
remaining entries after the first one and the first two entries are cleared to zeros. 


X= np.array([[1, 2, 3], [4, 5, 611) 
npx.sequence_mask(X, np.array([1, 21), True, axis=1) 


array AO 
is, Bu 


We can also mask all the entries across the last few axes. If you like, you may even specify to 
replace such entries with a non-zero value. 
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X = np.ones((2, 3, 4)) 
npx.sequence_mask(X, np.array([1, 2]), True, value=-1, axis=1) 





ErrevQUEE dy tag doy ted, 
[-1., -1., -1., -1.], 
Elo. “ley Say iadd 
EE ean ioa PE E 
E a TOS f 
[log “Wey =e, mia) 


Now we can extend the softmax cross-entropy loss to allow the masking of irrelevant predictions. 
Initially, masks for all the predicted tokens are set to one. Once the valid length is given, the mask 
corresponding to any padding token will be cleared to zero. In the end, the loss for all the tokens 
will be multipled by the mask to filter out irrelevant predictions of padding tokens in the loss. 


#@save 
class MaskedSoftmaxCELoss(gluon.loss.SoftmaxCELoss): 
"""The softmax cross-entropy loss with masks.””"” 
# ‘pred’ shape: ('batch_size', ‘num_steps*, 'vocab_size') 
# ‘label* shape: ('batch_size', ‘num_steps*) 
# ‘valid_len* shape: (‘batch_size*,) 
def forward(self, pred, label, valid_len): 
# ‘weights’ shape: ('batch_size', 'num_steps', 1) 
weights = np.expand_dims(np.ones_like(label), axis=-1) 
weights = npx.sequence_mask(weights, valid_len, True, axis=1) 
return super(MaskedSoftmaxCELoss, self).forward(pred, label, weights) 


For a sanity check, we can create three identical sequences. Then we can specify that the valid 
lengths of these sequences are 4, 2, and 0, respectively. As a result, the loss of the first sequence 
should be twice as large as that of the second sequence, while the third sequence should have a 
zero loss. 


loss = MaskedSoftmaxCELoss() 
loss(np.ones((3, 4, 10)), np.ones((3, 4)), np.array([4, 2, 2])) 


array([2.3025851, 1.1512926, 0. 1) 


9.7.4 Training 


In the following training loop, we concatenate the special beginning-of-sequence token and the 
original output sequence excluding the final token as the input to the decoder, as shown in Fig. 
9.7.1. This is called teacher forcing because the original output sequence (token labels) is fed into 
the decoder. Alternatively, we could also feed the predicted token from the previous time step as 
the current input to the decoder. 


#@save 

def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device): 
"""Train a model for sequence to sequence.””” 
net.initialize(init.Xavier(), force_reinit=True, ctx=device) 


(continues on next page) 
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(continued from previous page) 


trainer = gluon.Trainer(net.collect_params(), 'adam', 
£'learning_rate': 1r)) 
loss = MaskedSoftmaxCELoss() 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[10, num_epochs]) 
for epoch in range(num_epochs): 
timer = d21.Timer() 
metric = d21.Accumulator(2) + Sum of training loss, no. of tokens 
for batch in data_iter: 
X, X_valid_len, Y, Y_valid_len = [ 
x.as_in_ctx(device) for x in batch] 
bos = np.array( 
[tgt_vocab[ '<bos>'1] * Y.shape[0], ctx=device).reshape(-1, 1) 


dec_input = np.concatenate([bos, Y[:, :-1]], 1) + Teacher forcing 
with autograd.record(): 
Y_hat, _ = net(X, dec_input, X_valid_len) 


1 = loss(Y_hat, Y, Y_valid_len) 
1. backward() 
d21.grad_clipping(net, 1) 
num_tokens = Y_valid_len.sum() 
trainer .step(num_tokens) 
metric.add(1.sum(), num_tokens) 
if (epoch + 1) % 10 == 
animator.add(epoch + 1, (metric[0] / metric[1],)) 
print(f'loss (metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} ' 
f'tokens/sec on {str(device) }’) 


Now we can create and train an RNN encoder-decoder model for sequence to sequence learning 
on the machine translation dataset. 


embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1 
batch_size, num_steps = 64, 10 
Ir, num_epochs, device = 0.005, 300, d21.try_gpu() 


train_iter, src_vocab, tgt_vocab = d21.load_data_nmt(batch_size, num_steps) 
encoder = Seq2SeqEncoder ( 
len(src_vocab), embed_size, num_hiddens, num_layers, dropout) 
decoder = Seq2SeqDecoder ( 
len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout) 
net = d21.EncoderDecoder(encoder, decoder) 
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device) 


loss 0.022, 7028.2 tokens/sec on gpu(Q) 
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9.7.5 Prediction 


To predict the output sequence token by token, at each decoder time step the predicted token from 
the previous time step is fed into the decoder as an input. Similar to training, at the initial time 
step the beginning-of-sequence (“<bos>”) token is fed into the decoder. This prediction process is 
illustrated in Fig. 9.7.3. When the end-of-sequence (“<eos>”) token is predicted, the prediction of 
the output sequence is complete. 


Encoder Decoder 


Ils regardent : <eos> 


ry LG 


They are watching : <eos> i 
<bos> 


Fig. 9.7.3: Predicting the output sequence token by token using an RNN encoder-decoder. 


We will introduce different strategies for sequence generation in Section 9.8. 


#@save 
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps, 
device, save_attention_weights=False): 
"""Predict for sequence to sequence.””” 
src_tokens = src_vocab[src_sentence.lower().split(' ')] + [ 
src_vocab[ '<eos>']] 
enc_valid_len = np.array([len(src_tokens)], ctx=device) 
src_tokens = d21.truncate_pad(src_tokens, num_steps, src_vocabl[ '<pad>']) 
# Add the batch axis 
enc_X = np.expand_dims(np.array(src_tokens, ctx=device), axis=0) 
enc_outputs = net.encoder(enc_X, enc_valid_len) 
dec_state = net.decoder.init_state(enc_outputs, enc_valid_len) 
# Add the batch axis 
dec_X = np.expand_dims(np.array([tgt_vocab['<bos>']], ctx=device), axis=0) 
output_seq, attention_weight_seq = [], [] 


(continues on next page) 
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(continued from previous page) 
for _ in range(num_steps): 
Y, dec_state = net.decoder(dec_X, dec_state) 
# We use the token with the highest prediction likelihood as the input 
# of the decoder at the next time step 
dec_X = Y.argmax(axis=2) 
pred = dec_X.squeeze(axis=0).astype('int32').item() 
# Save attention weights (to be covered later) 
if save_attention_weights: 
attention_weight_seq. append(net.decoder.attention_weights) 
# Once the end-of-sequence token is predicted, the generation of the 
# output sequence is complete 
if pred == tgt_vocab[ '<eos>']: 
break 
output_seq. append(pred) 
return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq 


9.7.6 Evaluation of Predicted Sequences 


We can evaluate a predicted sequence by comparing it with the label sequence (the ground- 
truth). BLEU (Bilingual Evaluation Understudy), though originally proposed for evaluating ma- 
chine translation results (Papineni et al., 2002), has been extensively used in measuring the qual- 
ity of output sequences for different applications. In principle, for any n-grams in the predicted 
sequence, BLEU evaluates whether this n-grams appears in the label sequence. 


Denote by p,, the precision of n-grams, which is the ratio of the number of matched n-grams in the 
predicted and label sequences to the number of n-grams in the predicted sequence. To explain, 
given a label sequence A, B, C, D, E, F, anda predicted sequence A, B, B, C, D, we have pı = 4/5, 
p2 = 3/4, p3 = 1/3, and p4 = 0. Besides, let lenjap.) and lenprea be the numbers of tokens in the 
label sequence and the predicted sequence, respectively. Then, BLEU is defined as 


k 
; et ) ) 1/27 
exp | min | 0,1 — -*% ae (9.7.4) 
p ( ( lenpred II ý 


n=1 
where k is the longest n-grams for matching. 


Based on the definition of BLEU in (9.7.4), whenever the predicted sequence is the same as the la- 
bel sequence, BLEU is 1. Moreover, since matching longer n-grams is more difficult, BLEU assigns 
a greater weight to a longer n-gram precision. Specifically, when p,, is fixed, pil *" increases as n 
ae E 1/n ù k pk 
grows (the original paper uses px ). Furthermore, since predicting shorter sequences tends to 
obtain a higher pn value, the coefficient before the multiplication term in (9.7.4) penalizes shorter 
predicted sequences. For example, when k = 2, given the label sequence A, B, C, D, E, F and 
the predicted sequence A, B, although pı = p2 = 1, the penalty factor exp(1 — 6/2) ~ 0.14 lowers 
the BLEU. 


We implement the BLEU measure as follows. 


def bleu(pred_seq, label_seq, k): #@save 
"""Compute the BLEU.””” 
pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ') 
len_pred, len_label = len(pred_tokens), len(label_tokens) 
score = math.exp(min(0, 1 - len_label / len_pred)) 


(continues on next page) 
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for n in range(1, k + 1): 
num_matches, label_subs = 0, collections.defaultdict(int) 
for i in range(len_label - n + 1): 
label_subs[''.join(label_tokens[i: i + n])] += 1 
for i in range(len_pred - n + 1): 
if label_subs[''.join(pred_tokens[i: i + n])] > 0: 
num_matches += 1 
label_subs[' '.join(pred_tokens[i: i + n])] -= 1 
score x= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n)) 
return score 


In the end, we use the trained RNN encoder-decoder to translate a few English sentences into 
French and compute the BLEU of the results. 
engs = ['go .', "i lost .”, 'heXN's calm .', ‘i\'m home .'] 
fras = ['va !', 'jN'ai perdu .' 
for eng, fra in zip(engs, fras): 

translation, attention_weight_seq = predict_seq2seq( 

net, eng, src_vocab, tgt_vocab, num_steps, device) 
print(f’{eng} => {translation}, bleu {bleu(translation, fra, k=2):.3f)') 


i 


‘il est calme .', 'je suis chez moi .'] 


go . => va !, bleu 1.000 

i lost . => j'ai perdu ., bleu 1.000 

he's calm . => il est malade ., bleu 0.658 

i'm home . => je suis chez <unk> de la partie ., bleu 0.517 


Summary 


Following the design of the encoder-decoder architecture, we can use two RNNs to design a 
model for sequence to sequence learning. 


When implementing the encoder and the decoder, we can use multilayer RNNs. 


We can use masks to filter out irrelevant computations, such as when calculating the loss. 


In encoder-decoder training, the teacher forcing approach feeds original output sequences 
(in contrast to predictions) into the decoder. 


BLEU is a popular measure for evaluating output sequences by matching n-grams between 
the predicted sequence and the label sequence. 


Exercises 


1. Can you adjust the hyperparameters to improve the translation results? 


2. Rerun the experiment without using masks in the loss calculation. What results do you ob- 
serve? Why? 


3. Ifthe encoder and the decoder differ in the number of layers or the number of hidden units, 
how can we initialize the hidden state of the decoder? 


4. Intraining, replace teacher forcing with feeding the prediction at the previous time step into 
the decoder. How does this influence the performance? 
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5. Rerun the experiment by replacing GRU with LSTM. 


6. Are there any other ways to design the output layer of the decoder? 


Discussions?!” 


9.8 Beam Search 


In Section 9.7, we predicted the output sequence token by token until the special end-of-sequence 
“<eos>” token is predicted. In this section, we will begin with formalizing this greedy search strategy 
and exploring issues with it, then compare this strategy with other alternatives: exhaustive search 
and beam search. 


Before a formal introduction to greedy search, let us formalize the search problem using the same 
mathematical notation from Section 9.7. At any time step 1”, the probability of the decoder output 
yy is conditional on the output subsequence y;,...,Yw-—1 before t' and the context variable c that 
encodes the information of the input sequence. To quantify computational cost, denote by Y (it 
contains “<eos>”) the output vocabulary. So the cardinality |Y| of this vocabulary set is the vocab- 
ulary size. Let us also specify the maximum number of tokens of an output sequence as T”. Asa 
result, our goal is to search for an ideal output from all the OY") possible output sequences. 
Of course, for all these output sequences, portions including and after “<eos>” will be discarded 
in the actual output. 


9.8.1 Greedy Search 


First, let us take a look at a simple strategy: greedy search. This strategy has been used to predict 
sequences in Section 9.7. In greedy search, at any time step t of the output sequence, we search 
for the token with the highest conditional probability from y, i.e., 


yy = argmax P(y | yi,---,4v-1,€), (9.8.1) 
yey 
as the output. Once “<eos>” is outputted or the output sequence has reached its maximum length 
T’, the output sequence is completed. 


So what can go wrong with greedy search? In fact, the optimal sequence should be the output se- 
quence with the maximum Mis P(yy | Y1,...,Yw-1,€), which is the conditional probability of 
generating an output sequence based on the input sequence. Unfortunately, there is no guaran- 
tee that the optimal sequence will be obtained by greedy search. 


Time step 





Fig. 9.8.1: At each time step, greedy search selects the token with the highest conditional proba- 
bility. 





"9 https://discuss.d21.ai/t/345 
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Let us illustrate it with an example. Suppose that there are four tokens “A”, “B”, “C”, and “<eos>” 
in the output dictionary. In Fig. 9.8.1, the four numbers under each time step represent the 
conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that time step, respectively. 
At each time step, greedy search selects the token with the highest conditional probability. 
Therefore, the output sequence “A”, “B”, “C”, and “<eos>” will be predicted in Fig. 9.8.1. The 
conditional probability of this output sequence is 0.5 x 0.4 x 0.4 x 0.6 = 0.048. 


Time step 





Fig. 9.8.2: The four numbers under each time step represent the conditional probabilities of gen- 
erating “A”, “B”, “C”, and “<eos>” at that time step. At time step 2, the token “C”, which has the 
second highest conditional probability, is selected. 


Next, let us look at another example in Fig. 9.8.2. Unlike in Fig. 9.8.1, at time step 2 we select 
the token “C” in Fig. 9.8.2, which has the second highest conditional probability. Since the out- 
put subsequences at time steps 1 and 2, on which time step 3 is based, have changed from “A” 
and “B” in Fig. 9.8.1 to “A” and “C” in Fig. 9.8.2, the conditional probability of each token at time 
step 3 has also changed in Fig. 9.8.2. Suppose that we choose the token “B” at time step 3. Now 
time step 4 is conditional on the output subsequence at the first three time steps “A”, “C”, and 
“B”, which is different from “A”, “B”, and “C” in Fig. 9.8.1. Therefore, the conditional probability 
of generating each token at time step 4 in Fig. 9.8.2 is also different from that in Fig. 9.8.1. Asa 
result, the conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” in Fig. 9.8.2 
is 0.5 x 0.3 x 0.6 x 0.6 = 0.054, which is greater than that of greedy search in Fig. 9.8.1. In this 
example, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy search is not an 
optimal sequence. 


9.8.2 Exhaustive Search 


If the goal is to obtain the optimal sequence, we may consider using exhaustive search: exhaustively 
enumerate all the possible output sequences with their conditional probabilities, then output the 
one with the highest conditional probability. 


Although we can use exhaustive search to obtain the optimal sequence, its computational cost 
OY") is likely to be excessively high. For example, when |Y| = 10000 and 7” = 10, we will 
need to evaluate 10000!” = 10% sequences. This is next to impossible! On the other hand, the 
computational cost of greedy search is O(|V| T’): it is usually significantly smaller than that of 
exhaustive search. For example, when || = 10000 and T” = 10, we only need to evaluate 10000 x 
10 = 10° sequences. 





9.8. Beam Search 387 


9.8.3 Beam Search 


Decisions about sequence searching strategies lie on a spectrum, with easy questions at either 
extreme. What if only accuracy matters? Obviously, exhaustive search. What if only computa- 
tional cost matters? Clearly, greedy search. A real-world application usually asks a complicated 
question, somewhere in between those two extremes. 


Beam search is an improved version of greedy search. It has a hyperparameter named beam size, k. 
At time step 1, we select k tokens with the highest conditional probabilities. Each of them will be 
the first token of k candidate output sequences, respectively. At each subsequent time step, based 
on the k candidate output sequences at the previous time step, we continue to select k candidate 
output sequences with the highest conditional probabilities from k |Y| possible choices. 


Time step 1 Time step 2 Time step 3 
Candidates Candidates Candidates 
A A 
<A B 
A Ec Erca _. ABD 
- IS a e Lec 
wee E E 


Fig. 9.8.3: The process of beam search (beam size: 2, maximum length of an output sequence: 3). 
The candidate output sequences are A, C, AB, CE, ABD, and CED. 


Fig. 9.8.3 demonstrates the process of beam search with an example. Suppose that the output 
vocabulary contains only five elements: Y = {A, B,C, D, E}, where one of them is “<eos>”. Let 
the beam size be 2 and the maximum length of an output sequence be 3. At time step 1, suppose 
that the tokens with the highest conditional probabilities P(y; | c) are A and C. At time step 2, for 
all y2 € Y, we compute 


P(A, ya | c) = P(A | c)P (Ya | A, c), 


P(C, ya | e) = P(C | e)P (ya | C, e), (9.8.2) 


and pick the largest two among these ten values, say P(A, B | c) and P(C, E | c). Then at time 
step 3, for all yz € Y, we compute 


P(A, B,y3 | c) = P(A, B | c)P(y3 | A, B,c), 


(9.8.3) 
P(C, E, y3 | c) = P(C, E | c)P(y3 | C,E,c), 


and pick the largest two among these ten values, say P(A, B, D | c) and P(C, E, D | c). As a result, 
we get six candidates output sequences: (i) A; (ii) C; (iii) A, B; (iv) C, E; (v) A, B, D; and (vi) C, 
E, D. 
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In the end, we obtain the set of final candidate output sequences based on these six sequences 
(e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest 
of the following score as the output sequence: 


L 
1 1 
Ta log P(y1,..., YL) = La y log P(y | Yi, ...,Yu—1,€), (9.8.4) 
t'=1 
where L is the length of the final candidate sequence and a is usually set to 0.75. Since a longer 
sequence has more logarithmic terms in the summation of (9.8.4), the term L“ in the denominator 
penalizes long sequences. 


The computational cost of beam search is O(k|)| T"). This result is in between that of greedy 
search and that of exhaustive search. In fact, greedy search can be treated as a special type of 
beam search with a beam size of 1. With a flexible choice of the beam size, beam search provides 
a tradeoff between accuracy versus computational cost. 


Summary 


e Sequence searching strategies include greedy search, exhaustive search, and beam search. 


+ Beam search provides a tradeoff between accuracy versus computational cost via its flexible 
choice of the beam size. 


Exercises 


1. Can we treat exhaustive search as a special type of beam search? Why or why not? 


2. Apply beam search in the machine translation problem in Section 9.7. How does the beam 
size affect the translation results and the prediction speed? 


3. We used language modeling for generating text following user-provided prefixes in Section 
8.5. Which kind of search strategy does it use? Can you improve it? 


Discussions!”° 





12 https://discuss.d21.ai/t/338 
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10 Attention Mechanisms 


The optic nerve of a primate's visual system receives massive sensory input, far exceeding what 
the brain can fully process. Fortunately, not all stimuli are created equal. Focalization and con- 
centration of consciousness have enabled primates to direct attention to objects of interest, such 
as preys and predators, in the complex visual environment. The ability of paying attention to only 
a small fraction of the information has evolutionary significance, allowing human beings to live 
and succeed. 


Scientists have been studying attention in the cognitive neuroscience field since the 19th century. 
In this chapter, we will begin by reviewing a popular framework explaining how attention is de- 
ployed in a visual scene. Inspired by the attention cues in this framework, we will design models 
that leverage such attention cues. Notably, the Nadaraya-Waston kernel regression in 1964 is a 
simple demonstration of machine learning with attention mechanisms. 


Next, we will go on to introduce attention functions that have been extensively used in the design 
of attention models in deep learning. Specifically, we will show how to use these functions to 
design the Bahdanau attention, a groundbreaking attention model in deep learning that can align 
bidirectionally and is differentiable. 


In the end, equipped with the more recent multi-head attention and self-attention designs, we will 
describe the Transformer architecture based solely on attention mechanisms. Since their proposal 
in 2017, Transformers have been pervasive in modern deep learning applications, such as in areas 
of language, vision, speech, and reinforcement learning. 


10.1 Attention Cues 


Thank you for your attention to this book. Attention is a scarce resource: at the moment you 
are reading this book and ignoring the rest. Thus, similar to money, your attention is being paid 
with an opportunity cost. To ensure that your investment of attention right now is worthwhile, we 
have been highly motivated to pay our attention carefully to produce a nice book. Attention is the 
keystone in the arch of life and holds the key to any work's exceptionalism. 


Since economics studies the allocation of scarce resources, we are in the era of the attention econ- 
omy, where human attention is treated as a limited, valuable, and scarce commodity that can be 
exchanged. Numerous business models have been developed to capitalize on it. On music or video 
streaming services, we either pay attention to their ads or pay money to hide them. For growth 
in the world of online games, we either pay attention to participate in battles, which attract new 
gamers, or pay money to instantly become powerful. Nothing comes for free. 


All in all, information in our environment is not scarce, attention is. When inspecting a visual 
scene, our optic nerve receives information at the order of 10° bits per second, far exceeding what 
our brain can fully process. Fortunately, our ancestors had learned from experience (also known 
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as data) that not all sensory inputs are created equal. Throughout human history, the capability of 
directing attention to only a fraction of information of interest has enabled our brain to allocate 
resources more smartly to survive, to grow, and to socialize, such as detecting predators, preys, 
and mates. 


10.1.1 Attention Cues in Biology 


To explain how our attention is deployed in the visual world, a two-component framework has 
emerged and been pervasive. This idea dates back to William James in the 1890s, who is consid- 
ered the “father of American psychology” (James, 2007). In this framework, subjects selectively 
direct the spotlight of attention using both the nonvolitional cue and volitional cue. 


The nonvolitional cue is based on the saliency and conspicuity of objects in the environment. 
Imagine there are five objects in front of you: a newspaper, a research paper, a cup of coffee, a 
notebook, and a book such as in Fig. 10.1.1. While all the paper products are printed in black and 
white, the coffee cup is red. In other words, this coffee is intrinsically salient and conspicuous 
in this visual environment, automatically and involuntarily drawing attention. So you bring the 
fovea (the center of the macula where visual acuity is highest) onto the coffee as shown in Fig. 
10.1.1. 





El 


Fig. 10.1.1: Using the nonvolitional cue based on saliency (red cup, non-paper), attention is invol- 
untarily directed to the coffee. 


After drinking coffee, you become caffeinated and want to read a book. So you turn your head, 
refocus your eyes, and look at the book as depicted in Fig. 10.1.2. Different from the case in Fig. 
10.1.1 where the coffee biases you towards selecting based on saliency, in this task-dependent 
case you select the book under cognitive and volitional control. Using the volitional cue based on 
variable selection criteria, this form of attention is more deliberate. Itis also more powerful with 
the subject's voluntary effort. 
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Fig. 10.1.2: Using the volitional cue (want to read a book) that is task-dependent, attention is di- 
rected to the book under volitional control. 


10.1.2 Queries, Keys, and Values 


Inspired by the nonvolitional and volitional attention cues that explain the attentional deploy- 
ment, in the following we will describe a framework for designing attention mechanisms by in- 
corporating these two attention cues. 


To begin with, consider the simpler case where only nonvolitional cues are available. To bias 
selection over sensory inputs, we can simply use a parameterized fully-connected layer or even 
non-parameterized max or average pooling. 


Therefore, what sets attention mechanisms apart from those fully-connected layers or pooling 
layers is the inclusion of the volitional cues. In the context of attention mechanisms, we refer 
to volitional cues as queries. Given any query, attention mechanisms bias selection over sensory 
inputs (e.g., intermediate feature representations) via attention pooling. These sensory inputs are 
called values in the context of attention mechanisms. More generally, every value is paired with a 
key, which can be thought of the nonvolitional cue of that sensory input. As shown in Fig. 10.1.3, 
we can design attention pooling so that the given query (volitional cue) can interact with keys 
(nonvolitional cues), which guides bias selection over values (sensory inputs). 
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Output 
Keys Values 


(Nonvolitional cues) (Sensory inputs) 
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pooling 





Query [o] | 
(Volitional cue) 


Fig. 10.1.3: Attention mechanisms bias selection over values (sensory inputs) via attention pool- 
ing, which incorporates queries (volitional cues) and keys (nonvolitional cues). 


Note that there are many alternatives for the design of attention mechanisms. For instance, we 
can design a non-differentiable attention model that can be trained using reinforcement learning 
methods (Mnih et al., 2014). Given the dominance of the framework in Fig. 10.1.3, models under 
this framework will be the center of our attention in this chapter. 


10.1.3 Visualization of Attention 


Average pooling can be treated as a weighted average of inputs, where weights are uniform. In 
practice, attention pooling aggregates values using weighted average, where weights are computed 
between the given query and different keys. 


from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


To visualize attention weights, we define the show_heatmaps function. Its input matrices has the 
shape (number of rows for display, number of columns for display, number of queries, number of 
keys). 


#@save 

def show_heatmaps(matrices, 
xlabel, 
ylabel, 


titles=None, 
figsize=(2.5, 2.5), 
cmap='Reds'): 
d21.use_svg_display() 
num_rows, num_cols = matrices.shape[l0], matrices.shape[1] 
fig, axes = d21.plt.subplots(num_rows, 
num_cols, 
figsize=figsize, 
sharex=True, 
sharey=True, 


(continues on next page) 
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(continued from previous page) 


squeeze=False) 


for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)): 


for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)): 
pcm = ax.imshow(matrix.asnumpy(), cmap=cmap) 
if i == num_rows - 1: 
ax.set_xlabel(xlabel) 
if j == @: 
ax.set_ylabel(ylabel) 
if titles: 
ax.set_title(titles[jl) 


fig.colorbar(pcm, ax=axes, shrink=0.6) 


For demonstration, we consider a simple case where the attention weight is one only when the 
query and the key are the same; otherwise it is zero. 


attention_weights = np.eye(10).reshape((1, 1, 10, 10)) 
show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries') 


0 
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Y 0.75 
A 0.50 
D> 
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8 0.00 
0 5 
Keys 


In the subsequent sections, we will often invoke this function to visualize attention weights. 


Summary 


Human attention is a limited, valuable, and scarce resource. 


Subjects selectively direct attention using both the nonvolitional and volitional cues. The 
former is based on saliency and the latter is task-dependent. 


Attention mechanisms are different from fully-connected layers or pooling layers due to in- 
clusion of the volitional cues. 


Attention mechanisms bias selection over values (sensory inputs) via attention pooling, 
which incorporates queries (volitional cues) and keys (nonvolitional cues). Keys and values 
are paired. 


We can visualize attention weights between queries and keys. 
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Exercises 


1. What can be the volitional cue when decoding a sequence token by token in machine trans- 
lation? What are the nonvolitional cues and the sensory inputs? 


2. Randomly generate a 10 x 10 matrix and use the softmax operation to ensure each row is a 
valid probability distribution. Visualize the output attention weights. 


Discussions!2! 


10.2 Attention Pooling: Nadaraya-Watson Kernel Regression 


Now you know the major components of attention mechanisms under the framework in Fig. 
10.1.3. To recapitulate, the interactions between queries (volitional cues) and keys (nonvolitional 
cues) result in attention pooling. The attention pooling selectively aggregates values (sensory in- 
puts) to produce the output. In this section, we will describe attention pooling in greater detail 
to give you a high-level view of how attention mechanisms work in practice. Specifically, the 
Nadaraya-Watson kernel regression model proposed in 1964 is a simple yet complete example for 
demonstrating machine learning with attention mechanisms. 


from d21 import mxnet as d21 
from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 


npx.set_np() 


10.2.1 Generating the Dataset 


To keep things simple, let us consider the following regression problem: given a dataset of input- 
output pairs { (x1, y1),---,(%n, Yn) }, how to learn f to predict the output y = f(x) for any new input 
uP 


Here we generate an artificial dataset according to the following nonlinear function with the noise 
term e: 


yi = 2sin(z;) + 208 + e, (10.2.1) 


where e obeys a normal distribution with zero mean and standard deviation 0.5. Both 50 training 
examples and 50 testing examples are generated. To better visualize the pattern of attention later, 
the training inputs are sorted. 


n_train = 50 # No. of training examples 
x_train = np.sort(np.random.rand(n_train) * 5) # Training inputs 


def f(x): 
return 2 x np.sin(x) + xxx0.8 


y_train = f(x_train) + np.random.normal(0.0, 0.5, (n_train,)) + Training outputs 


(continues on next page) 
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x_test = np.arange(0, 5, 0.1) # Testing examples 

y_truth = f(x_test) # Ground-truth outputs for the testing examples 
n_test = len(x_test) # No. of testing examples 

n_test 


50 


The following function plots all the training examples (represented by circles), the ground-truth 
data generation function f without the noise term (labeled by “Truth”), and the learned prediction 
function (labeled by “Pred”). 


def plot_kernel_reg(y_hat): 
d21.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth’, 'Pred'], 


xlim=[0, 5], ylim=[-1, 5]) 
d21.plt.plot(x_train, y_train, 'o', alpha=0.5); 


10.2.2 Average Pooling 


We begin with perhaps the world's “dumbest” estimator for this regression problem: using average 
pooling to average over all the training outputs: 


fu) = - 3 gu (10.2.2) 
i=1 


which is plotted below. As we can see, this estimator is indeed not so smart. 


y_hat = y_train.mean().repeat(n_test) 
plot_kernel_reg(y_hat) 
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10.2.3 Nonparametric Attention Pooling 


Obviously, average pooling omits the inputs x;. A better idea was proposed by Nadaraya 
(Nadaraya, 1964) and Waston (Watson, 1964) to weigh the outputs y; according to their input loca- 
tions: 





K(x — x;) 
-X SKG- a)” (10.2.3) 


where K is a kernel. The estimator in (10.2.3) is called Nadaraya-Watson kernel regression. Here we 
will not dive into details of kernels. Recall the framework of attention mechanisms in Fig. 10.1.3. 
From the perspective of attention, we can rewrite (10.2.3) in a more generalized form of attention 
pooling: 

n 


f(z) = Y ala Y (10.2.4) 


i=1 


where x is the query and (x;,y;) is the key-value pair. Comparing (10.2.4) and (10.2.2), the at- 
tention pooling here is a weighted average of values y;. The attention weight a(x, xi) in (10.2.4) 
is assigned to the corresponding value y; based on the interaction between the query x and the 
key x; modeled by a. For any query, its attention weights over all the key-value pairs are a valid 
probability distribution: they are non-negative and sum up to one. 

To gain intuitions of attention pooling, just consider a Gaussian kernel defined as 

1 u? 
== exp(——). (10.2.5) 


Plugging the Gaussian kernel into (10.2.4) and (10.2.3) gives 


Ku) = 





n _ _ 2 
= a exp (—1(« ad ) qu (10.2.6) 


= Y softmax a — 2:)) Yi. 


i=1 


In (10.2.6), a key z; that is closer to the given query x will get more attention via a larger attention 
weight assigned to the key’s corresponding value y;. 


Notably, Nadaraya-Watson kernel regression is a nonparametric model; thus (10.2.6) is an exam- 
ple of nonparametric attention pooling. In the following, we plot the prediction based on this non- 
parametric attention model. The predicted line is smooth and closer to the ground-truth than that 
produced by average pooling. 


# Shape of ‘X_repeat*: ('n_test', 'n_train'), where each row contains the 
# same testing inputs (i.e., same queries) 

X_repeat = x_test.repeat(n_train).reshape((-1, n_train)) 

# Note that ‘x_train* contains the keys. Shape of 'attention_weights': 

# (n_test', 'n_train'), where each row contains attention weights to be 
# assigned among the values ('y_train') given each query 
attention_weights = npx.softmax(-(X_repeat - x_train)**2 / 2) 


(continues on next page) 
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# Each element of ‘y_hat*‘ is weighted average of values, where weights are 
# attention weights 

y_hat = np.dot(attention_weights, y_train) 

plot_kernel_reg(y_hat) 





Now let us take a look at the attention weights. Here testing inputs are queries while training 
inputs are keys. Since both inputs are sorted, we can see that the closer the query-key pair is, the 
higher attention weight is in the attention pooling. 


d21.show_heatmaps(np.expand_dims(np.expand_dims(attention_weights, 0), 0), 
xlabel='Sorted training inputs’, 
ylabel='Sorted testing inputs’) 
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10.2.4 Parametric Attention Pooling 


Nonparametric Nadaraya-Watson kernel regression enjoys the consistency benefit: given enough 
data this model converges to the optimal solution. Nonetheless, we can easily integrate learnable 
parameters into attention pooling. 


As an example, slightly different from (10.2.6), in the following the distance between the query x 
and the key x; is multiplied a learnable parameter w: 


n 


f(x) =d aa, 20Yi 


i=l 


(10.2.7) 





B baa (- 5((x — zi)w)’) 
Da 


Dj=1 XP (3 ((1 — 2) yw)2) 2 
= Y softmax (zo — zi)w)?) Yi. 


In the rest of the section, we will train this model by learning the parameter of the attention pool- 
ing in (10.2.7). 





Batch Matrix Multiplication 


To more efficiently compute attention for minibatches, we can leverage batch matrix multiplica- 
tion utilities provided by deep learning frameworks. 


Suppose that the first minibatch contains n matrices X;,...,X, of shape a x b, and the second 
minibatch contains n matrices Y¡,..., Y, of shape b x c. Their batch matrix multiplication results 
in n matrices X,Y¡,...,X, Y, of shape a x c. Therefore, given two tensors of shape (n, a, b) and (n, 
b, c), the shape of their batch matrix multiplication output is (n, a, c). 


X = np.ones((2, 1, 4)) 


Y = np.ones((2, 4, 6)) 
npx.batch_dot(X, Y).shape 


(2, 1, 6) 


In the context of attention mechanisms, we can use minibatch matrix multiplication to compute 
weighted averages of values in a minibatch. 


weights = np.ones((2, 10)) * 0.1 
values = np.arange(20).reshape((2, 10)) 
npx.batch_dot(np.expand_dims(weights, 1), np.expand_dims(values, -1)) 


array([L[ 4.511, 


[[14.5]]]) 
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Defining the Model 


Using minibatch matrix multiplication, below we define the parametric version of Nadaraya- 
Watson kernel regression based on the parametric attention pooling in (10.2.7). 


class NWKernelRegression(nn.Block): 
def __init__(self, **xkwargs): 
super().__init__(**kwargs) 
self.w = self.params.get('w’, shape=(1, )) 


def forward(self, queries, keys, values): 
# Shape of the output ‘queries* and ‘attention_weights*: 
# (no. of queries, no. of key-value pairs) 
queries = queries.repeat(keys.shape[1]).reshape((-1, keys.shape[1])) 
self.attention_weights = npx.softmax( 
-((queries - keys) * self.w.data())**2 / 2) 
# Shape of ‘values*‘: (no. of queries, no. of key-value pairs) 
return npx.batch_dot(np.expand_dims(self.attention_weights, 1), 
np.expand_dims(values, -1)).reshape(-1) 


Training 


In the following, we transform the training dataset to keys and values to train the attention model. 
In the parametric attention pooling, any training input takes key-value pairs from all the training 
examples except for itself to predict its output. 


# Shape of *X_tile': ('n_train', 'n_train'), where each column contains the 
# same training inputs 

X_tile = np.tile(x_train, (n_train, 1)) 

# Shape of *Y_tile': ('n_train', 'n_train'), where each column contains the 
# same training outputs 

Y_tile = np.tile(y_train, (n_train, 1)) 

# Shape of ‘keys*: ('n_train', 'n_train' - 1) 

keys = X_tile[(1 - np.eye(n_train)).astype('bool')].reshape((n_train, -1)) 

# Shape of 'values': ('n_train', 'n_train' - 1) 

values = Y_tile[(1 - np.eye(n_train)).astype('bool')].reshape((n_train, -1)) 


Using the squared loss and stochastic gradient descent, we train the parametric attention model. 


net = NWKernelRegression() 

net.initialize() 

loss = gluon.loss.L2Loss() 

trainer = gluon.Trainer(net.collect_params(), ‘sgd’, {'learning_rate’: 0.5)) 
animator = d21.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5]) 


for epoch in range(5): 
with autograd.record(): 
1 = loss(net(x_train, keys, values), y_train) 
1. backward() 
trainer.step(1) 
print(f’epoch {epoch + 1}, loss (float(1.sum()):.6f)') 
animator.add(epoch + 1, float(1.sum())) 
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After training the parametric attention model, we can plot its prediction. Trying to fit the training 
dataset with noise, the predicted line is less smooth than its nonparametric counterpart that was 
plotted earlier. 


# Shape of 'keys': (‘n_test*‘, 'n_train'), where each column contains the same 
# training inputs (i.e., same keys) 

keys = np.tile(x_train, (n_test, 1)) 

# Shape of ‘value*: (‘n_test*, 'n_train') 

values = np.tile(y_train, (n_test, 1)) 

y_hat = net(x_test, keys, values) 

plot_kernel_reg(y_hat) 





Comparing with nonparametric attention pooling, the region with large attention weights be- 
comes sharper in the learnable and parametric setting. 


d21.show_heatmaps(np.expand_dims(np.expand_dims(net.attention_weights, 0), 0), 
xlabel='Sorted training inputs’, 
ylabel='Sorted testing inputs’) 
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Summary 


e Nadaraya-Watson kernel regression is an example of machine learning with attention mech- 
anisms. 


e The attention pooling of Nadaraya-Watson kernel regression is a weighted average of the 
training outputs. From the attention perspective, the attention weight is assigned to a value 
based on a function of a query and the key that is paired with the value. 


+ Attention pooling can be either nonparametric or parametric. 


Exercises 


1. Increase the number of training examples. Can you learn nonparametric Nadaraya-Watson 
kernel regression better? 


2. What is the value of our learned w in the parametric attention pooling experiment? Why 
does it make the weighted region sharper when visualizing the attention weights? 


3. How can we add hyperparameters to nonparametric Nadaraya-Watson kernel regression to 
predict better? 


4. Design another parametric attention pooling for the kernel regression of this section. Train 
this new model and visualize its attention weights. 


Discussions??? 


10.3 Attention Scoring Functions 


In Section 10.2, we used a Gaussian kernel to model interactions between queries and keys. Treat- 
ing the exponent of the Gaussian kernel in (10.2.6) as an attention scoring function (or scoring func- 
tion for short), the results of this function were essentially fed into a softmax operation. Asa 
result, we obtained a probability distribution (attention weights) over values that are paired with 
keys. In the end, the output of the attention pooling is simply a weighted sum of the values based 
on these attention weights. 





12 https://discuss.d21.ai/t/1598 
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At a high level, we can use the above algorithm to instantiate the framework of attention mech- 
anisms in Fig. 10.1.3. Denoting an attention scoring function by a, Fig. 10.3.1 illustrates how the 
output of attention pooling can be computed as a weighted sum of values. Since attention weights 
are a probability distribution, the weighted sum is essentially a weighted average. 


Attention attention OQ Output 


scoring f 
function weights 





Keys 


Query 


Fig. 10.3.1: Computing the output of attention pooling as a weighted average of values. 


Mathematically, suppose that we have a query q € R? and m key-value pairs (k1, v1), ... , (km, Vm), 
where any k; € R* and any v; € R”. The attention pooling f is instantiated as a weighted sum of 
the values: 


F(q, (k1, v1), -- - , (Km, Vm)) = Y alq, ki)vi € R”, (10.3.1) 
i=1 


where the attention weight (scalar) for the query q and key k; is computed by the softmax operation 
of an attention scoring function a that maps two vectors to a scalar: 


dak =a EE ER. (10.3.2) 
J= ) 


As we can see, different choices of the attention scoring function a lead to different behaviors of 
attention pooling. In this section, we introduce two popular scoring functions that we will use to 
develop more sophisticated attention mechanisms later. 





import math 

from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 
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10.3.1 Masked Softmax Operation 


As we just mentioned, a softmax operation is used to output a probability distribution as attention 
weights. In some cases, not all the values should be fed into attention pooling. For instance, for 
efficient minibatch processing in Section 9.5, some text sequences are padded with special tokens 
that do not carry meaning. To get an attention pooling over only meaningful tokens as values, we 
can specify a valid sequence length (in number of tokens) to filter out those beyond this specified 
range when computing softmax. In this way, we can implement such a masked softmax operation 
in the following masked_softmax function, where any value beyond the valid length is masked as 
zero. 


#@save 
def masked_softmax(X, valid_lens): 
"""Perform softmax operation by masking elements on the last axis. 
# `X`: 3D tensor, 'valid_lens': 1D or 2D tensor 
if valid_lens is None: 
return npx.softmax(X) 
else: 
shape = X.shape 
if valid_lens.ndim == 
valid_lens = valid_lens.repeat(shape[1]) 
else: 
valid_lens = valid_lens.reshape(-1) 
# On the last axis, replace masked elements with a very large negative 
# value, whose exponentiation outputs 0 
X = npx.sequence_mask(X.reshape(-1, shape[-1]), valid_lens, True, 
value=-1e6, axis=1) 
return npx.softmax(X) . reshape (shape) 


nnn 


To demonstrate how this function works, consider a minibatch of two 2 x 4 matrix examples, 
where the valid lengths for these two examples are two and three, respectively. As a result of the 
masked softmax operation, values beyond the valid lengths are all masked as zero. 


masked_softmax(np.random.uniform(size=(2, 2, 4)), np.array([2, 3])) 


array([[[0.488994 , 0.511006 , Q. To: J; 
[0.4365484 , 0.56345165, 0. ee ales 
[[0.288171 , 0.3519408 , 0.3598882 , 0. de 
[0.29034296, 0.25239873, 0.45725837, 0. 11D 


Similarly, we can also use a two-dimensional tensor to specify valid lengths for every row in each 
matrix example. 


masked_softmax(np.random.uniform(size=(2, 2, 4)), 
np.array(L[1, 3], [2, 41D) 


array([[[1. 5 Oe 5 Gs a Oe 1, 
[0.35848376, 0.3658879 , 0.27562833, 0. 11), 
[[0.54370314, 0.45629686, 0. a 1, 


[0.19598778, 0.25580427, 0.19916739, 0.3490406 ]1]) 
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10.3.2 Additive Attention 


In general, when queries and keys are vectors of different lengths, we can use additive attention 
as the scoring function. Given a query q € R% and a key k e R*, the additive attention scoring 
function 


a(q,k) = w]tanh(W,q + W;k) € R, (10.3.3) 


where learnable parameters W, € R’*4, Wp € R"**, and w, € R”. Equivalent to (10.3.3), the 
query and the key are concatenated and fed into an MLP with a single hidden layer whose number 
of hidden units is h, a hyperparameter. By using tanh as the activation function and disabling bias 
terms, we implement additive attention in the following. 


#@save 
class AdditiveAttention(nn.Block): 
"""Additive attention. uai 
def __init__(self, num_hiddens, dropout, **kwargs): 
super (AdditiveAttention, self).__init__(**kwargs) 
# Use ‘flatten=False* to only transform the last axis so that the 
# shapes for the other axes are kept the same 
self .W_k = nn.Dense(num_hiddens, use_bias=False, flatten=False) 
self.W_q = nn.Dense(num_hiddens, use_bias=False, flatten=False) 
self.w_v = nn.Dense(1, use_bias=False, flatten=False) 
self.dropout = nn.Dropout (dropout) 


def forward(self, queries, keys, values, valid_lens): 
queries, keys = self.W_q(queries), self.W_k(keys) 
# After dimension expansion, shape of ‘queries‘: (‘batch_size‘, no. of 
# queries, 1, 'num_hiddens') and shape of ‘keys‘: ('batch_size', 1, 
# no. of key-value pairs, ‘num_hiddens*). Sum them up with 
# broadcasting 
features = np.expand_dims(queries, axis=2) + np.expand_dims( 
keys, axis=1) 
features = np.tanh(features) 
# There is only one output of 'self.w_v', so we remove the last 
# one-dimensional entry from the shape. Shape of 'scores': 
# (‘batch_size*, no. of queries, no. of key-value pairs) 
scores = np.squeeze(self.w_v(features), axis=-1) 
self.attention_weights = masked_softmax(scores, valid_lens) 
# Shape of ‘values*‘: (‘batch_size‘, no. of key-value pairs, value 
# dimension) 
return npx.batch_dot(self.dropout(self.attention_weights), values) 


Let us demonstrate the above AdditiveAttention class with a toy example, where shapes (batch 
size, number of steps or sequence length in tokens, feature size) of queries, keys, and values are 
(2, 1, 20), (2, 10, 2), and (2, 10, 4), respectively. The attention pooling output has a shape of (batch 
size, number of steps for queries, feature size for values). 


queries, keys = np.random.normal(0, 1, (2, 1, 20)), np.ones((2, 10, 2)) 
# The two value matrices in the ‘values* minibatch are identical 

values = np.arange(40).reshape(1, 10, 4).repeat(2, axis=0) 

valid_lens = np.array([2, 6]) 


attention = AdditiveAttention(num_hiddens=8, dropout=0.1) 


(continues on next page) 





406 Chapter 10. Attention Mechanisms 


(continued from previous page) 


attention.initialize() 
attention(queries, keys, values, valid_lens) 


array(LE[ 2. o Be , 4. TS Il; 


[[10. -iik , 12.000001, 13. 111) 


Although additive attention contains learnable parameters, since every key is the same in this 
example, the attention weights are uniform, determined by the specified valid lengths. 


d21.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)), 


xlabel='Keys’, 
ylabel=’Queries’) 
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10.3.3 Scaled Dot-Product Attention 


A more computationally efficient design for the scoring function can be simply dot product. How- 
ever, the dot product operation requires that both the query and the key have the same vector 
length, say d. Assume that all the elements of the query and the key are independent random 
variables with zero mean and unit variance. The dot product of both vectors has zero mean and a 
variance of d. To ensure that the variance of the dot product still remains one regardless of vector 
length, the scaled dot-product attention scoring function 


a(q,k) = q'k/vd (10.3.4) 


divides the dot product by Vd. In practice, we often think in minibatches for efficiency, such as 
computing attention for n queries and m key-value pairs, where queries and keys are of length d 
and values are of length v. The scaled dot-product attention of queries Q € R”*¢, keys K € R™*¢, 
and values V € R™*” is 


K! 
softmax (©) VeR”*. (10.3.5) 


vd 


In the following implementation of the scaled dot product attention, we use dropout for model 
regularization. 


#@save 

class DotProductAttention(nn.Block): 
"""Scaled dot product attention. 
def __init__(self, dropout, **kwargs): 


nnn 


(continues on next page) 
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super(DotProductAttention, self).__init__(*x*kwargs) 
self .dropout = nn.Dropout (dropout) 


# Shape of '*queries': ('batch_size', no. of queries, 'd') 
# Shape of 'keys': ('batch_size', no. of key-value pairs, `d`) 
# Shape of 'values': ('batch_size', no. of key-value pairs, value 
# dimension) 
# Shape of ‘valid_lens*: (‘batch_size‘,) or ('batch_size', no. of queries) 
def forward(self, queries, keys, values, valid_lens=None): 
d = queries. shape[-1] 
# Set ‘transpose_b=True* to swap the last two dimensions of ‘keys* 
scores = npx.batch_dot(queries, keys, transpose_b=True) / math.sqrt(d) 
self.attention_weights = masked_softmax(scores, valid_lens) 
return npx.batch_dot(self.dropout(self.attention_weights), values) 


To demonstrate the above DotProductAttention class, we use the same keys, values, and valid 
lengths from the earlier toy example for additive attention. For the dot product operation, we 
make the feature size of queries the same as that of keys. 


queries = np.random.normal(0, 1, (2, 1, 2)) 
attention = DotProductAttention(dropout=0.5) 
attention. initialize() 

attention(queries, keys, values, valid_lens) 


array([L[ 2. E o the 7 Be 111, 


[[10. 5 lla , 12.000001, 13. 31) 


Same as in the additive attention demonstration, since keys contains the same element that cannot 
be differentiated by any query, uniform attention weights are obtained. 


d21.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)), 


xlabel='Keys’, 
ylabel=’Queries’) 


0.4 


Queries 
PRO 
o 
N 


Keys 0.0 





408 Chapter 10. Attention Mechanisms 


Summary 


* We can compute the output of attention pooling as a weighted average of values, where dif- 
ferent choices of the attention scoring function lead to different behaviors of attention pool- 
ing. 


e When queries and keys are vectors of different lengths, we can use the additive attention 
scoring function. When they are the same, the scaled dot-product attention scoring function 
is more computationally efficient. 


Exercises 


1. Modify keys in the toy example and visualize attention weights. Do additive attention and 
scaled dot-product attention still output the same attention weights? Why or why not? 


2. Using matrix multiplications only, can you design a new scoring function for queries and 
keys with different vector lengths? 


3. When queries and keys have the same vector length, is vector summation a better design 
than dot product for the scoring function? Why or why not? 


Discussions! 


10.4 Bahdanau Attention 


We studied the machine translation problem in Section 9.7, where we designed an encoder- 
decoder architecture based on two RNNs for sequence to sequence learning. Specifically, the 
RNN encoder transforms a variable-length sequence into a fixed-shape context variable, then the 
RNN decoder generates the output (target) sequence token by token based on the generated to- 
kens and the context variable. However, even though not all the input (source) tokens are useful 
for decoding a certain token, the same context variable that encodes the entire input sequence is 
still used at each decoding step. 


In a separate but related challenge of handwriting generation for a given text sequence, Graves 
designed a differentiable attention model to align text characters with the much longer pen trace, 
where the alignment moves only in one direction (Graves, 2013). Inspired by the idea of learning to 
align, Bahdanau et al. proposed a differentiable attention model without the severe unidirectional 
alignment limitation (Bahdanau et al., 2014). When predicting a token, if not all the input tokens 
are relevant, the model aligns (or attends) only to parts of the input sequence that are relevant to 
the current prediction. This is achieved by treating the context variable as an output of attention 
pooling. 





12 https://discuss.d21.ai/t/346 
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10.4.1 Model 


When describing Bahdanau attention forthe RNN encoder-decoder below, we will followthe same 
notation in Section 9.7. The new attention-based model is the same as that in Section 9.7 except 
that the context variable c in (9.7.3) is replaced by cy at any decoding time step t’. Suppose that 
there are T tokens in the input sequence, the context variable at the decoding time step ?’ is the 
output of attention pooling: 


da 
Cy = y o(sy-1,h,)h,, (10.4.1) 


t=1 


where the decoder hidden state s,_, at time step t’ — 1 is the query, and the encoder hidden states 
h, are both the keys and values, and the attention weight a is computed as in (10.3.2) using the 
additive attention scoring function defined by (10.3.3). 


Slightly different from the vanilla RNN encoder-decoder architecture in Fig. 9.7.2, the same archi- 
tecture with Bahdanau attention is depicted in Fig. 10.4.1. 


Encoder Decoder 









nx 


Recurrent layer Recurrent layer 


TT a= e 


1 1 


Sources Targets 






Fig. 10.4.1: Layers in an RNN encoder-decoder model with Bahdanau attention. 


from d21 import mxnet as d21 
from mxnet import np, npx 

from mxnet.gluon import rnn, nn 
npx.set_np() 


10.4.2 Defining the Decoder with Attention 


To implement the RNN encoder-decoder with Bahdanau attention, we only need to redefine the 
decoder. To visualize the learned attention weights more conveniently, the following Attention- 
Decoder class defines the base interface for decoders with attention mechanisms. 


#@save 
class AttentionDecoder(d21.Decoder): 
"""The base attention-based decoder interface. 
def __init__(self, **xkwargs): 
super(AttentionDecoder, self).__init__(**kwargs) 


nnn 


@property 


(continues on next page) 
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def 


(continued from previous page) 


attention_weights(self): 
raise NotImplementedError 


Now let us implement the RNN decoder with Bahdanau attention in the following 
Seq2SeqAttentionDecoder class. The state of the decoder is initialized with i) the encoder 
final-layer hidden states at all the time steps (as keys and values of the attention); ii) the encoder 
all-layer hidden state at the final time step (to initialize the hidden state of the decoder); and iii) 
the encoder valid length (to exclude the padding tokens in attention pooling). At each decoding 
time step, the decoder final-layer hidden state at the previous time step is used as the query of 
the attention. As a result, both the attention output and the input embedding are concatenated 
as the input of the RNN decoder. 


class Seq2SeqAttentionDecoder (AttentionDecoder) : 


def 


def 


def 


_init__(self, vocab_size, embed_size, num_hiddens, num_layers, 
dropout=0, **kwargs): 

super (Seq2SeqAttentionDecoder, self).__init__(**kwargs) 

self.attention = d21.AdditiveAttention(num_hiddens, dropout) 

self.embedding = nn.Embedding(vocab_size, embed_size) 

self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout) 

self.dense = nn.Dense(vocab_size, flatten=False) 


init_state(self, enc_outputs, enc_valid_lens, xargs): 

# Shape of ‘outputs*: ('num_steps', 'batch_size', 'num_hiddens”). 
# Shape of ‘hidden_state[@]*: (‘num_layers*, 'batch_size', 

# ‘num_hiddens* ) 

outputs, hidden_state = enc_outputs 

return (outputs. swapaxes(@, 1), hidden_state, enc_valid_lens) 


forward(self, X, state): 
# Shape of ‘enc_outputs*: (‘batch_size*, ‘num_steps‘, ‘num_hiddens*). 
# Shape of 'hidden_state[0]*: (‘num_layers*, 'batch_size', 
# ‘num_hiddens* ) 
enc_outputs, hidden_state, enc_valid_lens = state 
# Shape of the output `X`: ('num_steps', 'batch_size', 'embed_size') 
X = self .embedding(X).swapaxes(0, 1) 
outputs, self._attention_weights = [], [] 
for Xen X: 
# Shape of *query*: (`batch_size`, 1, ‘num_hiddens*) 
query = np.expand_dims(hidden_state[0][-1], axis=1) 
# Shape of ‘context*: (`batch_size`, 1, ‘num_hiddens*) 
context = self.attention( 
query, enc_outputs, enc_outputs, enc_valid_lens) 
# Concatenate on the feature dimension 
x = np.concatenate((context, np.expand_dims(x, axis=1)), axis=-1) 
# Reshape 'x' as (1, 'batch_size', ‘embed_size* + ‘num_hiddens*) 
out, hidden_state = self.rnn(x.swapaxes(0, 1), hidden_state) 
outputs.append(out) 
self._attention_weights.append(self.attention.attention_weights) 
# After fully-connected layer transformation, shape of ‘outputs’: 
# (‘num_steps‘, 'batch_size', 'vocab_size') 
outputs = self.dense(np.concatenate(outputs, axis=0)) 
return outputs.swapaxes(0, 1), [enc_outputs, hidden_state, 
enc_valid_lens] 


(continues on next page) 
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@property 
def attention_weights(self): 
return self._attention_weights 


In the following, we test the implemented decoder with Bahdanau attention using a minibatch of 
4 sequence inputs of 7 time steps. 


encoder = d21.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16, 
num_layers=2) 

encoder. initialize() 

decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16, 

num_layers=2) 

decoder. initialize() 

X = np.zeros((4, 7)) + (‘batch_size‘, 'num_steps') 

state = decoder.init_state(encoder(X), None) 

output, state = decoder(X, state) 

output.shape, len(state), state[Q].shape, len(state[1]), state[l1][0].shape 


((4, 7, 10), 3, (4, 7, 16), 1, (2, 4, 16)) 


10.4.3 Training 


Similar to Section 9.7.4, here we specify hyperparemeters, instantiate an encoder and a decoder 
with Bahdanau attention, and train this model for machine translation. Due to the newly added 
attention mechanism, this training is much slower than that in Section 9.7.4 without attention 
mechanisms. 


embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1 
batch_size, num_steps = 64, 10 
Ir, num_epochs, device = 0.005, 250, d21.try_gpu() 


train_iter, src_vocab, tgt_vocab = d21.load_data_nmt(batch_size, num_steps) 
encoder = d21.Seq2SeqEncoder ( 
len(src_vocab), embed_size, num_hiddens, num_layers, dropout) 
decoder = Seq2SeqAttentionDecoder ( 
len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout) 
net = d21.EncoderDecoder(encoder, decoder) 
d21.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device) 


loss 0.024, 2725.2 tokens/sec on gpu(Q) 
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After the model is trained, we use itto translate a few English sentences into French and compute 
their BLEU scores. 


engs = ['go .', "i lost .”, 'heXN's calm .', 'iN'm home .'] 
fras = ['va !’, 'j\’ai perdu .’, 'il est calme .’, 'je suis chez moi .'] 
for eng, fra in zip(engs, fras): 
translation, dec_attention_weight_seq = d21.predict_seq2seq( 
net, eng, src_vocab, tgt_vocab, num_steps, device, True) 
print(f'(eng) => {translation}, ', 
f'bleu (d21.bleu(translation, fra, k=2):.3f)') 


1 


go . => va !, bleu 1.000 

i lost . => j'ai perdu ., bleu 1.000 

he's calm . => il est riche ., bleu 0.658 
i'm home . => je suis chez moi ., bleu 1.000 


attention_weights = np.concatenate( 
Cstep[O]CLO]LO] for step in dec_attention_weight_seq], 0).reshape( 
(1, 1, -1, num_steps)) 


By visualizing the attention weights when translating the last English sentence, we can see that 
each query assigns non-uniform weights over key-value pairs. It shows that at each decoding step, 
different parts of the input sequences are selectively aggregated in the attention pooling. 


# Plus one to include the end-of-sequence token 
d21.show_heatmaps( 
attention_weights[:, :, :, :len(engs[-17.splitQ) + 1], 
xlabel='Key posistions', ylabel='Query posistions') 
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Summary 


e When predicting a token, if not all the input tokens are relevant, the RNN encoder-decoder 
with Bahdanau attention selectively aggregates different parts of the input sequence. This 
is achieved by treating the context variable as an output of additive attention pooling. 


+ In the RNN encoder-decoder, Bahdanau attention treats the decoder hidden state at the pre- 
vious time step as the query, and the encoder hidden states at all the time steps as both the 
keys and values. 


Exercises 


1. Replace GRU with LSTM in the experiment. 


2. Modify the experiment to replace the additive attention scoring function with the scaled 
dot-product. How does it influence the training efficiency? 


Discussions!?* 


10.5 Multi-Head Attention 


In practice, given the same set of queries, keys, and values we may want our model to combine 
knowledge from different behaviors of the same attention mechanism, such as capturing depen- 
dencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. Thus, it may 
be beneficial to allow our attention mechanism to jointly use different representation subspaces 
of queries, keys, and values. 


To this end, instead of performing a single attention pooling, queries, keys, and values can be 
transformed with h independently learned linear projections. Then these h projected queries, 
keys, and values are fed into attention pooling in parallel. In the end, h attention pooling outputs 
are concatenated and transformed with another learned linear projection to produce the final out- 
put. This design is called multi-head attention, where each of the h attention pooling outputs is a 
head (Vaswani et al., 2017). Using fully-connected layers to perform learnable linear transforma- 
tions, Fig. 10.5.1 describes multi-head attention. 





12 https://discuss.d21.ai/t/347 
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Queries Keys Values 


Fig. 10.5.1: Multi-head attention, where multiple heads are concatenated then linearly trans- 
formed. 


10.5.1 Model 


Before providing the implementation of multi-head attention, let us formalize this model math- 
ematically. Given a query q € R%, a key k € R%, and a value v € R®, each attention head h; 
(i =1,...,h) is computed as 


h; = (Wig, WK, w("y) € R”, (10.5.1) 


where learnable parameters wo € RPaxda, w™ € IRP*X4k and wo) € ReX% and f is attention 
pooling, such as additive attention and scaled dot-product attention in Section 10.3. The multi- 
head attention output is another linear transformation via learnable parameters W, € R?°*"?» of 
the concatenation of h heads: 


h; 
W, |: | ERP. (10.5.2) 
h; 


Based on this design, each head may attend to different parts of the input. More sophisticated 
functions than the simple weighted average can be expressed. 


from d21 import mxnet as d21 

import math 

from mxnet import autograd, np, npx 
from mxnet.gluon import nn 
npx.set_np() 


10.5.2 Implementation 


In our implementation, we choose the scaled dot-product attention for each head of the multi- 
head attention. To avoid significant growth of computational cost and parameterization cost, we 
set Py = Pk = Pv = Po/h. Note that h heads can be computed in parallel if we set the number of 
outputs of linear transformations for the query, key, and value to pgh = pkh = pyh = po. In the 
following implementation, po is specified via the argument num_hiddens. 
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#@save 
class MultiHeadAttention(nn.Block): 
def __init__(self, num_hiddens, num_heads, dropout, use_bias=False, 
*xkwargs) : 

super(MultiHeadAttention, self).__init__(**kwargs) 
self.num_heads = num_heads 
self.attention = d21.DotProductAttention(dropout) 
self.W_q = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False) 
self.W_k = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False) 
self.W_v = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False) 
self.W_o = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False) 


def forward(self, queries, keys, values, valid_lens): 

Shape of ‘queries*, ‘keys‘, or ‘values’: 

(‘batch_size*, no. of queries or key-value pairs, ‘num_hiddens*) 
Shape of ‘valid_lens*: 

(‘batch_size‘,) or (‘batch_size*, no. of queries) 

After transposing, shape of output ‘queries‘, 'keys', or ‘values*: 
(‘batch_size* * ‘num_heads*, no. of queries or key-value pairs, 
# ‘num_hiddens* / ‘num_heads*) 

queries = transpose_qkv(self.W_q(queries), self.num_heads) 

keys = transpose_qkv(self.W_k(keys), self.num_heads) 

values = transpose_qkv(self.W_v(values), self.num_heads) 


te th tk tk He te 


if valid_lens is not None: 
# On axis 0, copy the first item (scalar or vector) for 
# ‘num_heads* times, then copy the next item, and so on 
valid_lens = valid_lens.repeat(self.num_heads, axis=0) 


# Shape of ‘output*‘: (‘batch_size‘ * ‘num_heads*, no. of queries, 
+ ‘num_hiddens* / ‘num_heads*) 
output = self.attention(queries, keys, values, valid_lens) 


# Shape of ‘output_concat*: 

# (‘batch_size*, no. of queries, 'num_hiddens” ) 
output_concat = transpose_output(output, self.num_heads) 
return self.W_o(output_concat) 


To allow for parallel computation of multiple heads, the above MultiHeadAttention class uses two 
transposition functions as defined below. Specifically, the transpose_output function reverses the 
operation of the transpose_qkv function. 


#@save 
def transpose_qkv(X, num_heads): 


# Shape of input ‘X*: 

# (batch_size', no. of queries or key-value pairs, ‘num_hiddens*). 
# Shape of output ‘X*: 

# (‘batch_size*, no. of queries or key-value pairs, ‘num_heads* , 

# ‘num_hiddens* / ‘num_heads*) 

X = X.reshape(X.shape[Q], X.shape[1], num_heads, -1) 


# Shape of output `X`: 

# (‘batch_size*, ‘num_heads‘, no. of queries or key-value pairs, 
# ‘num_hiddens* / ‘num_heads*) 

X = X.transpose(@, 2, 1, 3) 


(continues on next page) 
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# Shape of *output': 

# (‘batch_size* * 'num_heads', no. of queries or key-value pairs, 
# ‘num_hiddens* / ‘num_heads*) 

return X.reshape(-1, X.shape[2], X.shape[3]) 


#@save 

def transpose_output(X, num_heads): 
"""Reverse the operation of ‘transpose_qkv 
X = X.reshape(-1, num_heads, X.shape[1], X.shape[2]) 
X = X.transpose(0, 2, 1, 3) 
return X.reshape(X.shape[l0], X.shape[1], -1) 


vana 


Let us test our implemented MultiHeadAttention class using a toy example where keys and values 
arethesame. Asaresult, the shape ofthe multi-head attention outputis (batch_size, num_queries, 
num_hiddens). 


num_hiddens, num_heads = 100, 5 
attention = MultiHeadAttention(num_hiddens, num_heads, 0.5) 
attention.initialize() 


batch_size, num_queries, num_kvpairs, valid_lens = 2, 4, 6, np.array([3, 21) 
X = np.ones((batch_size, num_queries, num_hiddens)) 

Y = np.ones((batch_size, num_kvpairs, num_hiddens)) 

attention(X, Y, Y, valid_lens).shape 


(2, 4, 100) 


Summary 
e Multi-head attention combines knowledge of the same attention pooling via different rep- 
resentation subspaces of queries, keys, and values. 


+ To compute multiple heads of multi-head attention in parallel, proper tensor manipulation 
is needed. 


Exercises 


1. Visualize attention weights of multiple heads in this experiment. 


2. Suppose that we have a trained model based on multi-head attention and we want to prune 
least important attention heads to increase the prediction speed. How can we design exper- 
iments to measure the importance of an attention head? 


Discussions!?° 





13 https://discuss.d21.ai/t/1634 
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10.6 Self-Attention and Positional Encoding 


In deep learning, we often use CNNs or RNNs to encode a sequence. Now with attention mech- 
anisms. imagine that we feed a sequence of tokens into attention pooling so that the same set 
of tokens act as queries, keys, and values. Specifically, each query attends to all the key-value 
pairs and generates one attention output. Since the queries, keys, and values come from the same 
place, this performs self-attention (Lin et al., 2017b; Vaswani et al., 2017), which is also called intra- 
attention (Cheng et al., 2016; Parikh et al., 2016; Paulus et al., 2017). In this section, we will discuss 
sequence encoding using self-attention, including using additional information for the sequence 
order. 


from d21 import mxnet as d21 

import math 

from mxnet import autograd, np, npx 
from mxnet.gluon import nn 
npx.set_np() 


10.6.1 Self-Attention 


Given a sequence of input tokens x;,...,x,, where any x; € R? (1 < i < n), its self-attention 
outputs a sequence of the same length y,..., yn, where 
yi = f (Xi, (%1,X1),..., (Xn, Xn)) € RI (10.6.1) 


according to the definition of attention pooling f in (10.2.4). Using multi-head attention, the fol- 
lowing code snippet computes the self-attention of a tensor with shape (batch size, number of 
time steps or sequence length in tokens, d). The output tensor has the same shape. 


num_hiddens, num_heads = 100, 5 
attention = d21.MultiHeadAttention(num_hiddens, num_heads, 0.5) 
attention. initialize() 


batch_size, num_queries, valid_lens = 2, 4, np.array(L3, 2]) 
X = np.ones((batch_size, num_queries, num_hiddens)) 
attention(X, X, X, valid_lens).shape 


(2, 4, 100) 


10.6.2 Comparing CNNs, RNNs, and Self-Attention 


Let us compare architectures for mapping a sequence of n tokens to another sequence of equal 
length, where each input or output token is represented by a d-dimensional vector. Specifically, 
we will consider CNNs, RNNs, and self-attention. We will compare their computational complex- 
ity, sequential operations, and maximum path lengths. Note that sequential operations prevent 
parallel computation, while a shorter path between any combination of sequence positions makes 
it easier to learn long-range dependencies within the sequence (Hochreiter et al., 2001). 
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Fig. 10.6.1: Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures. 


Consider a convolutional layer whose kernel size is k. We will provide more details about sequence 
processing using CNNs in later chapters. For now, we only need to know that since the sequence 
length is n, the numbers of input and output channels are both d, the computational complexity of 
the convolutional layer is O(knd?). As Fig. 10.6.1 shows, CNNs are hierarchical so there are O(1) 
sequential operations and the maximum path length is O(n/k). For example, x; and x; are within 
the receptive field of a two-layer CNN with kernel size 3 in Fig. 10.6.1. 


When updating the hidden state of RNNs, multiplication of the d x d weight matrix and the d- 
dimensional hidden state has a computational complexity of O(d?). Since the sequence length is 
n, the computational complexity of the recurrent layer is O(nd?). According to Fig. 10.6.1, there 
are O(n) sequential operations that cannot be parallelized and the maximum path length is also 
O(n). 


In self-attention, the queries, keys, and values are all n x d matrices. Consider the scaled dot- 
product attention in (10.3.5), where a nx d matrix is multiplied by a dxn matrix, then the output n x 
n matrix is multiplied by a n x d matrix. As a result, the self-attention has a O(n?d) computational 
complexity. As we can see in Fig. 10.6.1, each token is directly connected to any other token via 
self-attention. Therefore, computation can be parallel with O(1) sequential operations and the 
maximum path length is also O(1). 


Allin all, both CNNs and self-attention enjoy parallel computation and self-attention has the short- 
est maximum path length. However, the quadratic computational complexity with respect to the 
sequence length makes self-attention prohibitively slow for very long sequences. 
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10.6.3 Positional Encoding 


Unlike RNNs that recurrently process tokens of a sequence one by one, self-attention ditches se- 
quential operations in favor of parallel computation. To use the sequence order information, we 
can inject absolute or relative positional information by adding positional encoding to the input rep- 
resentations. Positional encodings can be either learned or fixed. In the following, we describe a 
fixed positional encoding based on sine and cosine functions (Vaswani et al., 2017). 


Suppose that the input representation X € ¡R”*? contains the d-dimensional embeddings for n 
tokens of a sequence. The positional encoding outputs X + P using a positional embedding matrix 
P e R”*? of the same shape, whose element on the it! row and the (27)" or the (2j + 1) column 
is 


: 1 
Pi2j = sin (oz) > 


2 
Pija m EDS (oz) : 


At first glance, this trigonometric-function design looks weird. Before explanations of this design, 
let us first implement it in the following PositionalEncoding class. 


(10.6.2) 


#@save 
class PositionalEncoding(nn.Block): 
def __init__(self, num_hiddens, dropout, max_len=1000): 
super(PositionalEncoding, self).__init__Q 
self.dropout = nn.Dropout (dropout) 
# Create a long enough `P` 
self.P = np.zeros((1, max_len, num_hiddens)) 
X = np.arange(max_len).reshape(-1, 1) / np.power( 
10000, np.arange(0, num_hiddens, 2) / num_hiddens) 

self .PL:, :, 0::2] = np.sin(X) 
self.P[:, :, 1::2] = np.cos(X) 


def forward(self, X): 
X = X + self.P[l:, :X.shape[l1], :].as_in_ctx(X.ctx) 
return self .dropout(X) 


In the positional embedding matrix P, rows correspond to positions within a sequence and 
columns represent different positional encoding dimensions. In the example below, we can see 
that the 6% and the 7% columns of the positional embedding matrix have a higher frequency than 
the 8" and the 9% columns. The offset between the 6'" and the 7% (same for the 8 and the 9) 
columns is due to the alternation of sine and cosine functions. 


encoding_dim, num_steps = 32, 60 

pos_encoding = PositionalEncoding(encoding_dim, 0) 

pos_encoding.initialize() 

X = pos_encoding(np.zeros((1, num_steps, encoding_dim))) 

P = pos_encoding.P[:, :X.shape[1], :] 

d21.plot(np.arange(num_steps), PLO, :, 6:10].T, xlabel=’Row (position)”, 
figsize=(6, 2.5), legend=["Col %d” % d for d in np.arange(6, 10)]) 
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Absolute Positional Information 


To see how the monotonically decreased frequency along the encoding dimension relates to ab- 
solute positional information, let us print out the binary representations of 0,1,...,7. As we can 
see, the lowest bit, the second-lowest bit, and the third-lowest bit alternate on every number, ev- 
ery two numbers, and every four numbers, respectively. 


for i in range(8): 
print(f'(i) in binary is (1:>03b)') 


in binary is 000 
in binary is 001 
in binary is 010 
in binary is 011 
in binary is 100 
in binary is 101 
in binary is 110 
in binary is 111 


NOOB WN KF O 


In binary representations, a higher bit has a lower frequency than a lower bit. Similarly, as demon- 
strated in the heat map below, the positional encoding decreases frequencies along the encoding 
dimension by using trigonometric functions. Since the outputs are float numbers, such continu- 
ous representations are more space-efficient than binary representations. 


P = np.expand_dims(np.expand_dims(P[0, :, :], 0), 0) 
d21.show_heatmaps(P, xlabel='Column (encoding dimension)’, 
ylabel='Row (position)’, figsize=(3.5, 4), cmap='Blues’) 
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Relative Positional Information 


Besides capturing absolute positional information, the above positional encoding also allows a 
model to easily learn to attend by relative positions. This is because for any fixed position offset 
ô, the positional encoding at position i + ô can be represented by a linear projection of that at 


position i. 
This projection can be explained mathematically. Denoting w; = 1/100007//4, any pair of 
(Pi 25, Pi,2j4+1) in (10.6.2) can be linearly projected to (pi+5,2;, Pi+s,25+1) for any fixed offset 6: 


ba a | Pi,2j | 

—sin(dw;) cos(dw;)| |Pi2j+1 

> | cos(dw;) sin(iw;) + sin(dw;) cos(iw);) | 

— |- sin(ĝwj) sin(iw;) + cos(dw;) cos(iw;) (10.6.3) 

E = wel aa 
cos (( + ô)wj) 


— | Pi+ô,2j 
Pits, 2j+1] ” 


where the 2 x 2 projection matrix does not depend on any position index i. 
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Summary 


e In self-attention, the queries, keys, and values all come from the same place. 


* Both CNNs and self-attention enjoy parallel computation and self-attention has the shortest 
maximum path length. However, the quadratic computational complexity with respect to 
the sequence length makes self-attention prohibitively slow for very long sequences. 


+ To use the sequence order information, we can inject absolute or relative positional infor- 
mation by adding positional encoding to the input representations. 


Exercises 


1. Suppose that we design a deep architecture to represent a sequence by stacking self- 
attention layers with positional encoding. What could be issues? 


2. Can you design a learnable positional encoding method? 


Discussions!?0 


10.7 Transformer 


We have compared CNNs, RNNs, and self-attention in Section 10.6.2. Notably, self-attention en- 
joys both parallel computation and the shortest maximum path length. Therefore natually, it is 
appealing to design deep architectures by using self-attention. Unlike earlier self-attention mod- 
els that still rely on RNNs for input representations (Cheng et al., 2016; Lin et al., 2017b; Paulus 
et al., 2017), the Transformer model is solely based on attention mechanisms without any con- 
volutional or recurrent layer (Vaswani et al., 2017). Though originally proposed for sequence to 
sequence learning on text data, Transformers have been pervasive in a wide range of modern deep 
learning applications, such as in areas of language, vision, speech, and reinforcement learning. 


10.7.1 Model 


As an instance of the encoder-decoder architecture, the overall architecture of the Transformer 
is presented in Fig. 10.7.1. As we can see, the Transformer is composed of an encoder and a de- 
coder. Different from Bahdanau attention for sequence to sequence learning in Fig. 10.4.1, the 
input (source) and output (target) sequence embeddings are added with positional encoding be- 
fore being fed into the encoder and the decoder that stack modules based on self-attention. 





126 https://discuss.d21.ai/t/1651 
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Fig. 10.7.1: The Transformer architecture. 


Now we provide an overview of the Transformer architecture in Fig. 10.7.1. On a high level, the 
Transformer encoder is a stack of multiple identical layers, where each layer has two sublayers 
(either is denoted as sublayer). The first is a multi-head self-attention pooling and the second is a 
positionwise feed-forward network. Specifically, in the encoder self-attention, queries, keys, and 
values are all from the the outputs of the previous encoder layer. Inspired by the ResNet design 
in Section 7.6, a residual connection is employed around both sublayers. In the Transformer, for 
any input x € Rt at any position of the sequence, we require that sublayer(x) € R? so that the 
residual connection x + sublayer(x) € R? is feasible. This addition from the residual connection 
is immediately followed by layer normalization (Ba et al., 2016). As a result, the Transformer 
encoder outputs a d-dimensional vector representation for each position of the input sequence. 


The Transformer decoder is also a stack of multiple identical layers with residual connections and 
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layer normalizations. Besides the two sublayers described in the encoder, the decoder inserts 
a third sublayer, known as the encoder-decoder attention, between these two. In the encoder- 
decoder attention, queries are from the outputs of the previous decoder layer, and the keys and 
values are from the Transformer encoder outputs. In the decoder self-attention, queries, keys, and 
values are all from the the outputs of the previous decoder layer. However, each position in the 
decoder is allowed to only attend to all positions in the decoder up to that position. This masked 
attention preserves the auto-regressive property, ensuring that the prediction only depends on 
those output tokens that have been generated. 


We have already described and implemented multi-head attention based on scaled dot-products 
in Section 10.5 and positional encoding in Section 10.6.3. In the following, we will implement the 
rest of the Transformer model. 


from d21 import mxnet as d21 

import math 

from mxnet import autograd, np, npx 
from mxnet.gluon import nn 

import pandas as pd 

npx.set_np() 


10.7.2 Positionwise Feed-Forward Networks 


The positionwise feed-forward network transforms the representation at all the sequence posi- 
tions using the same MLP. This is why we call it positionwise. In the implementation below, the 
input X with shape (batch size, number of time steps or sequence length in tokens, number of 
hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor 
of shape (batch size, number of time steps, ffn_num_outputs). 


#@save 
class PositionWiseFFN(nn.Block): 
def __init__(self, ffn_num_hiddens, ffn_num_outputs, **kwargs): 
super (PositionWiseFFN, self).__init__(**kwargs) 
self.densel = nn.Dense(ffn_num_hiddens, flatten=False, 
activation='relu’) 
self.dense2 = nn.Dense(ffn_num_outputs, flatten=False) 


def forward(self, X): 
return self.dense2(self.densel(X)) 


The following example shows that the innermost dimension of a tensor changes to the number 
of outputs in the positionwise feed-forward network. Since the same MLP transforms at all the 
positions, when the inputs at all these positions are the same, their outputs are also identical. 


ffn = PositionWiseFFN(4, 8) 
ffn.initialize() 
ffn(np.ones((2, 3, 4)))[0] 


array([[ 0.00239431, @.00927085, -0.00021069, -0.00923989, -0.0082903 , 
-0.00162741, 0.00659031, 0.000239051, 
[ 0.00239431, 0.00927085, -0.00021069, -0.00923989, -0.0082903 , 
-0.00162741, 0.00659031, 0.000239051, 


(continues on next page) 
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C 0.00239431, @.00927085, -0.00021069, -0.00923989, -0.0082903 , 
-0.00162741, 0.00659031, 0.000239051]) 


10.7.3 Residual Connection and Layer Normalization 


Now let us focus on the “add & norm” component in Fig. 10.7.1. As we described at the beginning 
of this section, this is a residual connection immediately followed by layer normalization. Both 
are key to effective deep architectures. 


In Section 7.5, we explained how batch normalization recenters and rescales across the exam- 
ples within a minibatch. Layer normalization is the same as batch normalization except that the 
former normalizes across the feature dimension. Despite its pervasive applications in computer 
vision, batch normalization is usually empirically less effective than layer normalization in natu- 
ral language processing tasks, whose inputs are often variable-length sequences. 


The following code snippet compares the normalization across different dimensions by layer nor- 
malization and batch normalization. 


ln = nn.LayerNorm() 
In.initialize() 
bn = nn.BatchNorm() 
bn. initialize() 
X = np.array([[1, 2], [2, 3]]) 
# Compute mean and variance from *X' in the training mode 
with autograd.record(): 
print('layer norm:’, ln(X), 'Anbatch norm:’, bn(X)) 


layer norm: [[-0.99998 0.99998] 
[-0.99998 0.99998]] 

batch norm: [[-0.99998 -0.99998] 
E 0.99998 0.99998]] 


Now we can implement the AddNorm class using a residual connection followed by layer normal- 
ization. Dropout is also applied for regularization. 


#@save 
class AddNorm(nn.Block): 
def __init__(self, dropout, **kwargs): 
super(AddNorm, self).__init__(**kwargs) 
self.dropout = nn.Dropout (dropout) 
self.1n = nn.LayerNorm() 


def forward(self, X, Y): 
return self.1n(self.dropout(Y) + X) 


The residual connection requires that the two inputs are of the same shape so that the output 
tensor also has the same shape after the addition operation. 


add_norm = AddNorm(@.5) 
add_norm. initialize() 
add_norm(np.ones((2, 3, 4)), np.ones((2, 3, 4))).shape 
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(2, 3, 4) 


10.7.4 Encoder 


With all the essential components to assemble the Transformer encoder, let us start by imple- 
menting a single layer within the encoder. The following EncoderBlock class contains two sublay- 
ers: multi-head self-attention and positionwise feed-forward networks, where a residual connec- 
tion followed by layer normalization is employed around both sublayers. 


#@save 
class EncoderBlock(nn.Block): 
def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout, 
use_bias=False, **kwargs): 
super (EncoderBlock, self).__init__(**kwargs) 
self.attention = d21.MultiHeadAttention( 
num_hiddens, num_heads, dropout, use_bias) 

self.addnorm1 = AddNorm(dropout) 
self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens) 
self.addnorm2 = AddNorm(dropout) 


def forward(self, X, valid_lens): 
Y = self.addnorm1(X, self.attention(X, X, X, valid_lens)) 
return self.addnorm2(Y, self.ffn(Y)) 


As we can see, any layer in the Transformer encoder does not change the shape of its input. 


X = np.ones((2, 100, 24)) 

valid_lens = np.array([3, 2]) 

encoder_blk = EncoderBlock(24, 48, 8, 0.5) 
encoder_blk.initialize() 

encoder_b1k(X, valid_lens) .shape 


(2, 100, 24) 


In the following Transformer encoder implementation, we stack num_layers instances of the 
above EncoderBlock classes. Since we use the fixed positional encoding whose values are always 
between -1 and 1, we multiply values of the learnable input embeddings by the square root of 
the embedding dimension to rescale before summing up the input embedding and the positional 
encoding. 


#@save 
class TransformerEncoder (d21.Encoder) : 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_layers, dropout, use_bias=False, **kwargs): 
super(TransformerEncoder, self).__init__(**kwargs) 
self.num_hiddens = num_hiddens 
self.embedding = nn.Embedding(vocab_size, num_hiddens) 
self.pos_encoding = d21.PositionalEncoding(num_hiddens, dropout) 
self.blks = nn.Sequential() 
for _ in range(num_layers): 
self.blks.add( 


(continues on next page) 
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EncoderBlock(num_hiddens, ffn_num_hiddens, num_heads, dropout, 
use_bias)) 


def forward(self, X, valid_lens, xargs): 

# Since positional encoding values are between -1 and 1, the embedding 
# values are multiplied by the square root of the embedding dimension 
# to rescale before they are summed up 
X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens) ) 
self.attention_weights = [None] x» len(self.blks) 
for i, blk in enumerate(self.blks): 

X = blk(X, valid_lens) 

self .attention_weights[ 

i] = blk.attention. attention. attention_weights 

return X 


Below we specify hyperparameters to create a two-layer Transformer encoder. The shape of the 
Transformer encoder output is (batch size, number of time steps, num_hiddens). 


encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5) 
encoder. initialize() 
encoder(np.ones((2, 100)), valid_lens) .shape 


(2, 100, 24) 


10.7.5 Decoder 


As shown in Fig. 10.7.1, the Transformer decoder is composed of multiple identical layers. Each 
layer is implemented in the following DecoderBlock class, which contains three sublayers: de- 
coder self-attention, encoder-decoder attention, and positionwise feed-forward networks. These 
sublayers employ a residual connection around them followed by layer normalization. 


As we described earlier in this section, in the masked multi-head decoder self-attention (the first 
sublayer), queries, keys, and values all come from the outputs of the previous decoder layer. When 
training sequence-to-sequence models, tokens at all the positions (time steps) of the output se- 
quence are known. However, during prediction the output sequence is generated token by token; 
thus, at any decoder time step only the generated tokens can be used in the decoder self-attention. 
To preserve auto-regression in the decoder, its masked self-attention specifies dec_valid_lens so 
that any query only attends to all positions in the decoder up to the query position. 


class DecoderBlock(nn.Block): 
# The ‘i‘-th block in the decoder 
def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, 
dropout, i, **kwargs): 
super(DecoderBlock, self).__init__(**kwargs) 
self.i = i 
self.attention1 = d2l.MultiHeadAttention(num_hiddens, num_heads, 
dropout) 
self.addnorm1 = AddNorm(dropout) 
self.attention2 = d2l.MultiHeadAttention(num_hiddens, num_heads, 
dropout) 


(continues on next page) 
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self.addnorm2 = AddNorm(dropout) 
self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens) 
self.addnorm3 = AddNorm(dropout) 


def forward(self, X, state): 

enc_outputs, enc_valid_lens = state[0], state[1] 
During training, all the tokens of any output sequence are processed 
at the same time, so 'state[2][self.i]' is ‘None’ as initialized. 
When decoding any output sequence token by token during prediction, 
*“state[2][self.i]' contains representations of the decoded output at 
the ‘i‘-th block up to the current time step 
if state[2][self.i] is None: 

key_values = X 
else: 

key_values = np.concatenate((state[2][self.i], X), axis=1) 
state[2][self.i] = key_values 


He Ok He HH 


if autograd.is_training(): 


batch_size, num_steps, _ = X.shape 
# Shape of 'dec_valid_lens': ('batch_size', 'num_steps'), where 
# every row is [1, 2, ..., 'num_steps' ] 


dec_valid_lens = np.tile(np.arange(1, num_steps + 1, ctx=X.ctx), 
(batch_size, 1)) 
else: 
dec_valid_lens = None 


# Self-attention 

X2 = self .attention1(X, key_values, key_values, dec_valid_lens) 

Y = self.addnorm1(X, X2) 

# Encoder-decoder attention. Shape of ‘enc_outputs*: 

# (‘batch_size*, ‘num_steps*, ‘num_hiddens*) 

Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens) 
Z = self.addnorm2(Y, Y2) 

return self.addnorm3(Z, self.ffn(Z)), state 


To facilitate scaled dot-product operations in the encoder-decoder attention and addition opera- 
tions in the residual connections, the feature dimension (num_hiddens) of the decoder is the same 
as that of the encoder. 


decoder_blk = DecoderBlock(24, 48, 8, 0.5, 0) 
decoder_blk.initialize() 

X = np.ones((2, 100, 24)) 

state = [encoder_blk(X, valid_lens), valid_lens, [None]] 
decoder_b1k(X, state)[0].shape 


(2, 100, 24) 


Now we construct the entire Transformer decoder composed of num_layers instances of De- 
coderBlock. In the end, a fully-connected layer computes the prediction for all the vocab_size 
possible output tokens. Both of the decoder self-attention weights and the encoder-decoder at- 
tention weights are stored for later visualization. 
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class TransformerDecoder(d21.AttentionDecoder): 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_layers, dropout, **kwargs): 
super(TransformerDecoder, self).__init__(**kwargs) 
self.num_hiddens = num_hiddens 
self.num_layers = num_layers 
self.embedding = nn.Embedding(vocab_size, num_hiddens) 
self.pos_encoding = d21.PositionalEncoding(num_hiddens, dropout) 
self.blks = nn.Sequential() 
for i in range(num_layers): 
self .blks.add( 
DecoderBlock(num_hiddens, ffn_num_hiddens, num_heads, 
dropout, i)) 
self.dense = nn.Dense(vocab_size, flatten=False) 


def init_state(self, enc_outputs, enc_valid_lens, xargs): 
return [enc_outputs, enc_valid_lens, [None] * self.num_layers] 


def forward(self, X, state): 
X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens) ) 
self._attention_weights = [[None] * len(self.blks) for _ in range (2)] 
for i, blk in enumerate(self.blks): 
X, state = blk(X, state) 
# Decoder self-attention weights 
self._attention_weights[0][ 
i] = blk.attention1.attention.attention_weights 
# Encoder-decoder attention weights 
self._attention_weights[1][ 
i] = blk.attention2.attention.attention_weights 
return self .dense(X), state 


@property 
def attention_weights(self): 
return self._attention_weights 


10.7.6 Training 


Let us instantiate an encoder-decoder model by following the Transformer architecture. Here 
we specify that both the Transformer encoder and the Transformer decoder have 2 layers using 4- 
head attention. Similar to Section 9.7.4, we train the Transformer model for sequence to sequence 
learning on the English-French machine translation dataset. 


num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10 
Ir, num_epochs, device = 0.005, 200, d21.try_gpu() 
ffn_num_hiddens, num_heads = 64, 4 


train_iter, src_vocab, tgt_vocab = d21.load_data_nmt(batch_size, num_steps) 


encoder = TransformerEncoder ( 
len(src_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers, 
dropout) 

decoder = TransformerDecoder ( 
len(tgt_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers, 


(continues on next page) 
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dropout) 
net = d21.EncoderDecoder(encoder, decoder) 
d21.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device) 


loss 0.031, 2271.8 tokens/sec on gpu(0) 


0.150 
0.125 


0.100 


loss 


0.075 
0.050 


0.025 
50 100 150 200 


epoch 


After training, we use the Transformer model to translate a few English sentences into French and 
compute their BLEU scores. 


ri ” i 


NO A ENS Call mean mee MANO MSI 
fras = ['va !’, 'j\'ai perdu .’, 'il est calme .’, 'je suis chez moi .'] 
for eng, fra in zip(engs, fras): 
translation, dec_attention_weight_seq = d21.predict_seq2seq( 
net, eng, src_vocab, tgt_vocab, num_steps, device, True) 
print(f'(eng) => {translation}, ', 
f'bleu (d21.bleu(translation, fra, k=2):.3f)') 


go . => va !, bleu 1.000 

i lost . => j’ai perdu ., bleu 0.687 

he's calm . => il est calme c'est la partie !, bleu 0.497 
i'm home . => je suis chez moi ., bleu 1.000 


Let us visualize the Transformer attention weights when translating the last English sentence into 
French. The shape of the encoder self-attention weights is (number of encoder layers, number of 
attention heads, num_steps or number of queries, num_steps or number of key-value pairs). 


enc_attention_weights = np.concatenate(net.encoder.attention_weights, 
0).reshape((num_layers, num_heads, -1, 
num_steps)) 
enc_attention_weights.shape 


(2, 4, 10, 10) 


In the encoder self-attention, both queries and keys come from the same input sequence. Since 
padding tokens do not carry meaning, with specified valid length of the input sequence, no query 
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attends to positions of padding tokens. In the following, two layers of multi-head attention weights 
are presented row by row. Each head independently attends based on a separate representation 
subspaces of queries, keys, and values. 


d21.show_heatmaps( 
enc_attention_weights, xlabel=’Key positions’, ylabel='Query positions’, 
titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5)) 


o Head 1 Head 2 Head 3 Head 4 
c 0 
> 
D 
> 
o 0.4 
o Head 1 Head 2 Head 3 Head 4 
50 y = 0.2 
> 0.0 
T | 
e | 
0 5 0 5 0 5 0 5 


Key positions Key positions Key positions Key positions 


To visualize both the decoder self-attention weights and the encoder-decoder attention weights, 
we need more data manipulations. For example, we fill the masked attention weights with zero. 
Note that the decoder self-attention weights and the encoder-decoder attention weights both have 
the same queries: the beginning-of-sequence token followed by the output tokens. 


dec_attention_weights_2d = [ 
np.array(head[@]).tolist() for step in dec_attention_weight_seq 
for attn in step for blk in attn for head in blk 
] 
dec_attention_weights_filled = np.array( 
pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values) 
dec_attention_weights = dec_attention_weights_filled.reshape( 
(-1, 2, num_layers, num_heads, num_steps)) 
dec_self_attention_weights, dec_inter_attention_weights = \ 
dec_attention_weights.transpose(1, 2, 3, 0, 4) 
dec_self_attention_weights.shape, dec_inter_attention_weights. shape 


((2, 4, 6, 10), (2, 4, 6, 10)) 


Due to the auto-regressive property of the decoder self-attention, no query attends to key-value 
pairs after the query position. 


# Plus one to include the beginning-of-sequence token 
d21.show_heatmaps( 
dec_self_attention_weights[:, :, :, :len(translation.split()) + 1], 


(continues on next page) 
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xlabel='Key positions’, ylabel='Query positions’, 
titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5)) 
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Similar to the case in the encoder self-attention, via the specified valid length of the input se- 
quence, no query from the output sequence attends to those padding tokens from the input se- 
quence. 


d21.show_heatmaps( 
dec_inter_attention_weights, xlabel='Key positions’, 
ylabel='Query positions’, titles=['Head %d' % i for i in range(1, 5)], 
figsize=(7, 3.5)) 
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Although the Transformer architecture was originally proposed for sequence-to-sequence learn- 
ing, as we will discover later in the book, either the Transformer encoder or the Transformer 
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decoder is often individually used for different deep learning tasks. 


Summary 


The Transformer is an instance of the encoder-decoder architecture, though either the en- 
coder or the decoder can be used individually in practice. 


In the Transformer, multi-head self-attention is used for representing the input sequence 
and the output sequence, though the decoder has to preserve the auto-regressive property 
via a masked version. 


Both the residual connections and the layer normalization in the Transformer are important 
for training a very deep model. 


The positionwise feed-forward network in the Transformer model transforms the represen- 
tation at all the sequence positions using the same MLP. 


Exercises 


Discussions! 


. Train a deeper Transformer in the experiments. How does it affect the training speed and 


the translation performance? 


. Isita good idea to replace scaled dot-product attention with additive attention in the Trans- 


former? Why? 


. For language modeling, should we use the Transformer encoder, decoder, or both? How to 


design this method? 


. What can be challenges to Transformers if input sequences are very long? Why? 


. How to improve computational and memory efficiency of Transformers? Hint: you may 


refer to the survey paper by Tay et al. (Tay et al., 2020). 


. How can we design Transformer-based models for image classification tasks without using 


CNNs? Hint: you may refer to the Vision Transformer (Dosovitskiy et al., 2021). 


27 





12 https://discuss.d21.ai/t/348 
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11 Optimization Algorithms 


If you read the book in sequence up to this point you already used a number of optimization al- 
gorithms to train deep learning models. They were the tools that allowed us to continue updating 
model parameters and to minimize the value of the loss function, as evaluated on the training set. 
Indeed, anyone content with treating optimization as a black box device to minimize objective 
functions in a simple setting might well content oneself with the knowledge that there exists an 
array of incantations of such a procedure (with names such as “SGD” and “Adam”. 


To do well, however, some deeper knowledge is required. Optimization algorithms are important 
for deep learning. On one hand, training a complex deep learning model can take hours, days, or 
even weeks. The performance of the optimization algorithm directly affects the model's training 
efficiency. On the other hand, understanding the principles of different optimization algorithms 
and the role of their hyperparameters will enable us to tune the hyperparameters in a targeted 
manner to improve the performance of deep learning models. 


In this chapter, we explore common deep learning optimization algorithms in depth. Almost all 
optimization problems arising in deep learning are nonconvex. Nonetheless, the design and anal- 
ysis of algorithms in the context of convex problems have proven to be very instructive. It is for 
that reason that this chapter includes a primer on convex optimization and the proof for a very 
simple stochastic gradient descent algorithm on a convex objective function. 


11.1 Optimization and Deep Learning 


In this section, we will discuss the relationship between optimization and deep learning as well 
as the challenges of using optimization in deep learning. For a deep learning problem, we will 
usually define a loss function first. Once we have the loss function, we can use an optimization 
algorithm in attempt to minimize the loss. In optimization, a loss function is often referred to as 
the objective function of the optimization problem. By tradition and convention most optimiza- 
tion algorithms are concerned with minimization. If we ever need to maximize an objective there 
is a simple solution: just flip the sign on the objective. 
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11.1.1 Optimization and Estimation 


Although optimization provides a way to minimize the loss function for deep learning, in essence, 
the goals of optimization and deep learning are fundamentally different. The former is primarily 
concerned with minimizing an objective whereas the latter is concerned with finding a suitable 
model, given a finite amount of data. In Section 4.4, we discussed the difference between these 
two goals in detail. For instance, training error and generalization error generally differ: since the 
objective function of the optimization algorithm is usually a loss function based on the training 
dataset, the goal of optimization is to reduce the training error. However, the goal of statistical 
inference (and thus of deep learning) is to reduce the generalization error. To accomplish the 
latter we need to pay attention to overfitting in addition to using the optimization algorithm to 
reduce the training error. We begin by importing a few libraries for this chapter. 


%matplotlib inline 

from d21 import mxnet as d21 
from mpl_toolkits import mplot3d 
from mxnet import np, npx 
npx.set_np() 


Next we define two functions, the expected function f and the empirical function g, to illustrate 
this issue. Here the g is less smooth than f since we have only a finite amount of data. 


def f(x): return x * np.cos(np.pi * x) 
def g(x): return f(x) + 0.2 * np.cos(5 * np.pi * x) 


The graph below illustrates that the minimum of the training error may be at a different location 
than the minimum of the expected error (or of the test error). 


def annotate(text, xy, xytext): #@save 
d21.plt.gca().annotate(text, xy=xy, xytext=xytext, 
arrowprops=dict (arrowstyle='->’)) 


x = np.arange(0.5, 1.5, 0.01) 
d21.set_figsize((4.5, 2.5)) 
d21.plot(x, [f(x), 8007], ‘x’, 'risk') 


annotate('empirical risk', (1.0, -1.2), (0.5, -1.1)) 
annotate('expected risk', (1.1, -1.05), (0.95, -0.5)) 
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—0.50 


risk 


—0.75 


—1.00 


-1.25 
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11.1.2 Optimization Challenges in Deep Learning 


In this chapter, we are goingto focus specifically on the performance ofthe optimization algorithm 
in minimizing the objective function, rather than a model's generalization error. In Section 3.1 
we distinguished between analytical solutions and numerical solutions in optimization problems. 
In deep learning, most objective functions are complicated and do not have analytical solutions. 
Instead, we must use numerical optimization algorithms. The optimization algorithms below all 
fall into this category. 


There are many challenges in deep learning optimization. Some of the most vexing ones are local 
minima, saddle points and vanishing gradients. Let us have a look at a few of them. 


Local Minima 


For the objective function f(x), if the value of f(a) at x is smaller than the values of f(x) at any 
other points in the vicinity of x, then f(x) could be a local minimum. If the value of f(x) at x is 
the minimum of the objective function over the entire domain, then f(x) is the global minimum. 


For example, given the function 
fe) = x - cos(mx) tor — 1.0 < x < 2.0, (11.1.1) 
we can approximate the local minimum and global minimum of this function. 
x = np.arange(-1.0, 2.0, 0.01) 
cl loe, RCO J, “2? CON 


annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0)) 
annotate('global minimum', (1.1, -0.95), (0.6, 0.8)) 


global mini 


i 





The objective function of deep learning models usually has many local optima. When the nu- 
merical solution of an optimization problem is near the local optimum, the numerical solution 
obtained by the final iteration may only minimize the objective function locally, rather than glob- 
ally, as the gradient of the objective functions solutions approaches or becomes zero. Only some 
degree of noise might knock the parameter out of the local minimum. In fact, this is one of the 
beneficial properties of stochastic gradient descent where the natural variation of gradients over 
minibatches is able to dislodge the parameters from local minima. 
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Saddle Points 


Besides local minima, saddle points are another reason for gradients to vanish. A saddle point!?* 
is any location where all gradients of a function vanish but which is neither a global nor a local 
minimum. Consider the function f(x) = x. Its first and second derivative vanish for x = 0. 
Optimization might stall at the point, even though it is not a minimum. 


x = np.arange(-2.0, 2.0, 0.01) 
d21.plot(x, [x**3], 'x', 'f(x)') 
annotate('saddle point’, (0, -0.2), (-0.52, -5.0)) 


5 
X o 
\ 
-5 saddle point 


Saddle points in higher dimensions are even more insidious, as the example below shows. Con- 
sider the function f(x,y) = z? — y?. It has its saddle point at (0,0). This is a maximum with 
respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which is where this 
mathematical property got its name. 


x, y = np.meshgrid( 
np.linspace(-1.0, 1.0, 101), np.linspace(-1.0, 1.0, 101)) 
Z = SERED = y**2 


ax = d21.plt.figure() .add_subplot(111, projection='3d’) 
ax.plot_wireframe(x, y, z, **{'rstride’: 10, 'cstride': 10}) 
ax.plot([@], [0], [@], ‘rx’) 

ticks = [-1, 0, 1] 

d21.plt.xticks(ticks) 

d21.plt.yticks(ticks) 

ax.set_zticks(ticks) 

d21.p1t.xlabel('x'> 

d21.p1t.ylabel('y'); 





12 https://en.wikipedia.org/wiki/Saddle_point 
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We assume that the input of a function is a k-dimensional vector and its output is a scalar, so its 
Hessian matrix will have k eigenvalues (refer to Section 18.1). The solution of the function could 
be a local minimum, a local maximum, or a saddle point at a position where the function gradient 
is zero: 


e When the eigenvalues of the function's Hessian matrix at the zero-gradient position are all 
positive, we have a local minimum for the function. 


e When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all 
negative, we have a local maximum for the function. 


e When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are neg- 
ative and positive, we have a saddle point for the function. 


For high-dimensional problems the likelihood that at least some of the eigenvalues are negative 
is quite high. This makes saddle points more likely than local minima. We will discuss some ex- 
ceptions to this situation in the next section when introducing convexity. In short, convex func- 
tions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep 
learning problems do not fall into this category. Nonetheless itis a greattool to study optimization 
algorithms. 


Vanishing Gradients 


Probably the most insidious problem to encounter are vanishing gradients. For instance, assume 
that we want to minimize the function f(x) = tanh(x) and we happen to get started at x = 4. 
As we can see, the gradient of f is close to nil. More specifically f'(x) = 1 — tanh? (x) and thus 
f’(4) = 0.0013. Consequently optimization will get stuck for a long time before we make progress. 
This turns out to be one of the reasons that training deep learning models was quite tricky prior 
to the introduction of the ReLU activation function. 


x = np.arange(-2.0, 5.0, 0.01) 
d21.plot(x, [np.tanh(x)], ‘x’, 'f00 
annotate('vanishing gradient’, (4, 1), (2, 0.0)) 
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vanishing gradient 


—0.5 


—1.0 


As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust 
range of algorithms that perform well and that are easy to use even for beginners. Furthermore, 
it is not really necessary to find the best solution. Local optima or even approximate solutions 
thereof are still very useful. 


Summary 


Minimizing the training error does not guarantee that we find the best set of parameters to 
minimize the expected error. 


The optimization problems may have many local minima. 
The problem may have even more saddle points, as generally the problems are not convex. 


Vanishing gradients can cause optimization to stall. Often a reparameterization of the prob- 
lem helps. Good initialization of the parameters can be beneficial, too. 


Exercises 


. Consider a simple multilayer perceptron with a single hidden layer of, say, d dimensions in 


the hidden layer and a single output. Show that for any local minimum there are at least d! 
equivalent solutions that behave identically. 





2. Assume that we have a symmetric random matrix M where the entries M;; = Mj; are each 
drawn from some probability distribution p;;. Furthermore assume that p;; (£) = pi;(—£), 
i.e., that the distribution is symmetric (see e.g., (Wigner, 1958) for details). 

* Prove that the distribution over eigenvalues is also symmetric. That is, for any eigen- 
vector v the probability that the associated eigenvalue A satisfies P(A > 0) = P(A < 0). 
* Why does the above not imply P(A > 0) = 0.5? 
3. What other challenges involved in deep learning optimization can you think of? 
4. Assume that you want to balance a (real) ball on a (real) saddle. 
e Why is this hard? 
* Can you exploit this effect also for optimization algorithms? 
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Discussions??? 


11.2 Convexity 


Convexity plays a vital role in the design of optimization algorithms. This is largely due to the fact 
thatitis much easier to analyze and test algorithms in this context. In other words, ifthe algorithm 
performs poorly even in the convex setting we should not hope to see great results otherwise. 
Furthermore, even though the optimization problems in deep learning are generally nonconvex, 
they often exhibit some properties of convex ones near local minima. This can lead to exciting 
new optimization variants such as (Izmailov et al., 2018). 


11.2.1 Basics 


Let us begin with the basics. 


Sets 


Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any a,b € X 
the line segment connecting a and bis also in X. In mathematical terms this means that for all 
A € [0, 1] we have 


A-a+(1—A)-be xX whenever a,b e X. (11.2.1) 


This sounds a bit abstract. Consider the picture Fig. 11.2.1. The first set is not convex since there 
are line segments that are not contained in it. The other two sets suffer no such problem. 





Fig. 11.2.1: Three shapes, the left one is nonconvex, the others are convex 


Definitions on their own are not particularly useful unless you can do something with them. In 
this case we can look at unions and intersections as shown in Fig. 11.2.2. Assume that X and Y 
are convex sets. Then X N Y is also convex. To see this, consider any a,b € X NY. Since X and Y 
are convex, the line segments connecting a and b are contained in both X and Y. Given that, they 
also need to be contained in X N Y, thus proving our first theorem. 





12 https: //discuss.d21.ai/t/349 





11.2. Convexity 441 





Fig. 11.2.2: The intersection between two convex sets is convex 


We can strengthen this result with little effort: given convex sets X;, their intersection N;X; is 
convex. To see that the converse is not true, consider two disjoint sets X NY = Ø. Now pick a € X 
and b € Y. The line segment in Fig. 11.2.3 connecting a and b needs to contain some part that is 
neither in X nor Y, since we assumed that X N Y = (). Hence the line segment is not in X UY 
either, thus proving that in general unions of convex sets need not be convex. 





Fig. 11.2.3: The union of two convex sets need not be convex 


Typically the problems in deep learning are defined on convex domains. For instance R¢ is a 
convex set (after all, the line between any two points in Rt remains in R?). In some cases we work 
with variables of bounded length, such as balls of radius r as defined by {x|x € R? and ||x||2 < r}. 


Functions 


Now that we have convex sets we can introduce convex functions f. Given a convex set X a func- 
tion defined on it f : X —> Ris convex if for all x, x’ € X and for all A e [0, 1] we have 


Af (xz) + (1 —A)f(2’) > fOr t+ (1 —A)z’). (11.2.2) 


To illustrate this let us plot a few functions and check which ones satisfy the requirement. We 
need to import a few libraries. 


%matplotlib inline 

from d21 import mxnet as d21 
from mpl_toolkits import mplot3d 
from mxnet import np, npx 
npx.set_np() 


Let us define a few functions, both convex and nonconvex. 
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f = lambda x: 0.5 * xxx2 + Convex 
g = lambda x: np.cos(np.pi * x) # Nonconvex 
h = lambda x: np.exp(0.5 * x) + Convex 


x, segment = np.arange(-2, 2, 0.01), np.array([-1.5, 1]) 
d21.use_svg_display() 
_, axes = d21.plt.subplots(1, 3, figsize=(9, 3)) 
for ax, func in zip(axes, [f, g, h]): 
d21.plot([x, segment], [func(x), func(segment)], axes=ax) 





As expected, the cosine function is nonconvex, whereas the parabola and the exponential function 
are. Note that the requirement that X is a convex set is necessary for the condition to make sense. 
Otherwise the outcome of f(Ax + (1 — A)x*) might not be well defined. Convex functions have a 
number of desirable properties. 


Jensen’s Inequality 


One of the most useful tools is Jensen’s inequality. It amounts to a generalization of the definition 
of convexity: 


dof (2) >f (= ases and E,[f(x)| > f (£,[z]), (11.2.3) 


where a; are nonnegative real numbers such that >”, a; = 1. In other words, the expectation of a 
convex function is larger than the convex function of an expectation. To prove the first inequality 
we repeatedly apply the definition of convexity to one term in the sum at a time. The expectation 
can be proven by taking the limit over finite segments. 


One of the common applications of Jensen’s inequality is with regard to the log-likelihood of par- 
tially observed random variables. That is, we use 


EywP(y)|— log Plz | y)] = —log P(x). (11.2.4) 


This follows since f P(y)P(x | y)dy = P(x). This is used in variational methods. Here yis typically 
the unobserved random variable, P(y) is the best guess of how it might be distributed and P(x) is 
the distribution with y integrated out. For instance, in clustering y might be the cluster labels and 
P(a | y) is the generative model when applying cluster labels. 
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11.2.2 Properties 


Convex functions have a few useful properties. We describe them as follows. 


Local Minima is Global Minima 


In particular, the local minima for convex functions is also the global minima. Let us assume the 
contrary and prove it wrong. If x* € X is a local minimum such that there is a small positive value 
psothat for x € X that satisfies 0 < |x — x*| < pthereis f(1*) < f(a). Assume there exists x’ € X 
for which f(x’) < f(1*). According to the property of convexity, 


fr + (1-2) < Af(a*) + (1-1) f 2) 








<Af(250)+(1-A)f (2%) (11.2.5) 
< f(u*) 
There exists A € [0, 1), A = 1— a] for an example, so that 0 < |Az*+(1—A)a’—2*| < p. However, 


because f(Ax* + (1 — Ax”) < f(a*), this violates our local minimum statement. Therefore, there 
does not exist x’ € X for which f(x”) < f(x*). The local minimum z* is also the global minimum. 


For instance, the function f(x) = (x — 1)? has a local minimum for x = 1, it is also the global 
minimum. 


f = lambda x: (x-1)**2 
d21.set_figsize() 
d21.plot(Lx, segment], [f(x), f(segment)], ‘x’, 'f(x)’) 





The fact that the local minima for convex functions is also the global minima is very convenient. 
It means that if we minimize functions we cannot “get stuck”. Note, though, that this does not 
mean that there cannot be more than one global minimum or that there might even exist one. For 
instance, the function f(x) = max(|x| — 1,0) attains its minimum value over the interval [—1, 1]. 
Conversely, the function f(x) = exp(x) does not attain a minimum value on R. For x + —oo it 
asymptotes to 0, however there is no x for which f(x) = 0. 
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Convex Functions and Sets 


Convex functions define convex sets as below-sets. They are defined as 
Sp := {a|x € X and f(x) < b}. (11.2.6) 


Such sets are convex. Let us prove this quickly. Remember that for any x,x' € Sẹ we need to 
show that Ax + (1 — A)a’ € S, as long as A € [0, 1]. But this follows directly from the definition of 
convexity since f(Ax + (1— A)a’) < Af(x) + (1 — A)f(x') < b. 


Have a look at the function f(x,y) = 0.52? + cos(27y) below. It is clearly nonconvex. The level 
sets are correspondingly nonconvex. In fact, they are typically composed of disjoint sets. 


x, y = np.meshgrid( 
np.linspace(-1.0, 1.0, 101), np.linspace(-1.0, 1.0, 101)) 
Z = xx*2 + 0.5 * np.cos(2 * np.pi * y) 
# Plot the 3D surface 
d21.set_figsize((6, 4)) 
ax = d21.plt.figure().add_subplot(111, projection=' 3d’) 
ax.plot_wireframe(x, y, z, **('rstride': 10, 'cstride': 10}) 
ax.contour(x, y, z, offset=-1) 
ax.set_zlim(-1, 1.5) 
# Adjust labels 
for func in [d21.p1t.xticks, d21.plt.yticks, ax.set_zticks]: 
func([-1, 0, 1]) 
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Derivatives and Convexity 


Whenever the second derivative of a function exists it is very easy to check for convexity. All we 
need to do is check whether 0? f(x) = 0, i.e., whether all of its eigenvalues are nonnegative. For 
instance, the function f(x) = $||x||3 is convex since 02f = 1, i.e., its derivative is the identity 


matrix. 


The first thing to realize is that we only need to prove this property for one-dimensional functions. 
After all, in general we can always define some function g(z) = f(x + z- v). This function has the 
first and second derivatives g' = (0, f)! v and g” = v' (0? f)v respectively. In particular, g” > 0 for 
all v whenever the Hessian of f is positive semidefinite, i.e., whenever all of its eigenvalues are 
greater equal than zero. Hence back to the scalar case. 


To see that f”(x) > 0 for convex functions we use the fact that 








= f(x). (11.2.7) 


ero rro (2+7) 


2 2 
Since the second derivative is given by the limit over finite differences it follows that 


fæ +6) + fue) -2f (x) 


f" (x) = lim > >0. (11.2.8) 


e>0 € 





To see that the converse is true we use the fact that f” > 0 implies that f’ is a monotonically 
increasing function. Leta < x < b be three points in R. We use the mean value theorem to 
express 


f(x) — f(a) = (x — a) f'(a) for some a € [a, x] and 





(11.2.9) 
fŒ) — f(x) = (b— x) f'(b) for some £ € [z, b]. 
By monotonicity f'(B) > f'(a), hence 
F(b) — f(a) = f(b) — Hu) + f(x) — F(a) 
= (b— x) f'(8) + (x — a) f'(a) (11.2.10) 


> (b — a) f (a). 


By geometry it follows that f (x) is below the line connecting f(a) and f(b), thus proving convexity. 
We omit a more formal derivation in favor of a graph below. 


f = lambda x: 0.5 * xxx2 

x = np.arange(-2, 2, 0.01) 

axb, ab = np.array([-1.5, -0.5, 1]), np.array([-1.5, 1]) 
d21.set_figsize() 

d21.plot([x, axb, ab], [f(x) for x in [x, axb, ab]], 'x', 'f(x)’) 
cl ¿riada (Elo. MELO) (ElD. 1.5)) 

d2i annotate (e TED). Go MD) 

Al cc. (E PEI (Gl. MED 
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11.2.3 Constraints 


One of the nice properties of convex optimization is that it allows us to handle constraints effi- 
ciently. That is, it allows us to solve problems of the form: 


minimize f(x) 
x (11.2.11) 
subject to c;(x) < 0 for alli € {1,..., N}. 


Here f is the objective and the functions c; are constraint functions. To see what this does consider 
the case where cı (x) = ||x||2 — 1. In this case the parameters x are constrained to the unit ball. Ifa 
second constraint is c2(x) = v! x +b, then this corresponds to all x lying on a halfspace. Satisfying 
both constraints simultaneously amounts to selecting a slice of a ball as the constraint set. 


Lagrange Function 


In general, solving a constrained optimization problem is difficult. One way of addressing it stems 
from physics with a rather simple intuition. Imagine a ball inside a box. The ball will roll to the 
place that is lowest and the forces of gravity will be balanced out with the forces that the sides of 
the box can impose on the ball. In short, the gradient of the objective function (i.e., gravity) will 
be offset by the gradient of the constraint function (need to remain inside the box by virtue of the 
walls “pushing back”). Note that any constraint that is not active (i.e., the ball does not touch the 
wall) will not be able to exert any force on the ball. 


Skipping over the derivation of the Lagrange function L (see e.g., the book by Boyd and Vanden- 
berghe for details (Boyd & Vandenberghe, 2004)) the above reasoning can be expressed via the 
following saddlepoint optimization problem: 


L(x, a) = f(x) + oc (x) where a; > 0. (11.2.12) 


Here the variables a; are the so-called Lagrange Multipliers that ensure that a constraint is properly 
enforced. They are chosen just large enough to ensure that c;(x) < 0 for all i. For instance, for 
any x for which c;(x) < 0 naturally, we'd end up picking a; = 0. Moreover, this is a saddlepoint 
optimization problem where one wants to maximize L with respect to a and simultaneously mini- 
mize it with respect to x. There is a rich body of literature explaining how to arrive at the function 
L(x, a). For our purposes it is sufficient to know that the saddlepoint of L is where the original 
constrained optimization problem is solved optimally. 
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Penalties 


One way of satisfying constrained optimization problems at least approximately is to adapt the La- 
grange function L. Rather than satisfying c;(x) < 0 we simply add a;c;(x) to the objective function 
f(x). This ensures that the constraints will not be violated too badly. 


In fact, we have been using this trick all along. Consider weight decay in Section 4.5. In it we add 
à||w]|? to the objective function to ensure that w does not grow too large. Using the constrained 
optimization point of view we can see that this will ensure that ||w||? — r? < 0 for some radius r. 
Adjusting the value of A allows us to vary the size of w. 


In general, adding penalties is a good way of ensuring approximate constraint satisfaction. In 
practice this turns out to be much more robust than exact satisfaction. Furthermore, for noncon- 
vex problems many of the properties that make the exact approach so appealing in the convex 
case (e.g., optimality) no longer hold. 


Projections 


An alternative strategy for satisfying constraints are projections. Again, we encountered them 
before, e.g., when dealing with gradient clipping in Section 8.5. There we ensured that a gradient 
has length bounded by c via 


g + g- min(1,c/||g||). (11.2.13) 


This turns out to be a projection of g onto the ball of radius c. More generally, a projection on a 
(convex) set X is defined as 


Proj x(x) = argmin ||x — x’ ||. (11.2.14) 
xeX 

It is thus the closest point in X to x. This sounds a bit abstract. Fig. 11.2.4 explains it somewhat 

more clearly. In it we have two convex sets, a circle and a diamond. Points inside the set (yellow) 

remain unchanged. Points outside the set (black) are mapped to the closest point inside the set 

(red). While for Lə balls this leaves the direction unchanged, this need not be the case in general, 

as can be seen in the case of the diamond. 


Fig. 11.2.4: Convex Projections 


One of the uses for convex projections is to compute sparse weight vectors. In this case we project 
w onto an L ball (the latter is a generalized version of the diamond in the picture above). 
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Summary 


In the context of deep learning the main purpose of convex functions is to motivate optimiza- 
tion algorithms and help us understand them in detail. In the following we will see how gradient 
descent and stochastic gradient descent can be derived accordingly. 


Intersections of convex sets are convex. Unions are not. 


The expectation of a convex function is larger than the convex function of an expectation 
(Jensen's inequality). 


A twice-differentiable function is convex if and only if its second derivative has only non- 
negative eigenvalues throughout. 


Convex constraints can be added via the Lagrange function. In practice simply add them 
with a penalty to the objective function. 


Projections map to points in the (convex) set closest to the original point. 


Exercises 


1. 


Assume that we want to verify convexity of a set by drawing all lines between points within 
the set and checking whether the lines are contained. 


e Prove that it is sufficient to check only the points on the boundary. 


e Prove that it is sufficient to check only the vertices of the set. 


2. Denote by B,[r] := {x|x € R@ and ||x||, < r} the ball of radius r using the p-norm. Prove that 
B,|r] is convex for all p > 1. 
3. Given convex functions f and g show that max(f, g) is convex, too. Prove that min (f, g) is 
not convex. 
4. Prove that the normalization of the softmax function is convex. More specifically prove the 
convexity of f(x) = log >, exp(z;). 
5. Prove that linear subspaces are convex sets, i.e., X = {x|Wx = b}. 
6. Prove that in the case of linear subspaces with b = 0 the projection Proj, can be written as 
Mx for some matrix M. 
7. Show that for convex twice differentiable functions f we can write f(x +e) = f(1)+ef'(x)+ 
Le f" (x + €) for some £ € [0, e]. 
8. Given a vector w € R? with ||w||; > 1 compute the projection on the 4 unit ball. 
- As intermediate step write out the penalized objective ||w — w’||3 + \||w’ ||; and compute 
the solution for a given A > 0. 
e Can you find the ‘right’ value of A without a lot of trial and error? 
9. Given a convex set X and two vectors x and y prove that projections never increase distances, 
i.e., |x — yl] > ||Proj x (x) — Projx(y)||. 
Discussions??? 





130 https://discuss.d21.ai/t/350 
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11.3 Gradient Descent 


In this section we are going to introduce the basic concepts underlying gradient descent. This is 
brief by necessity. See e.g., (Boyd & Vandenberghe, 2004) for an in-depth introduction to convex 
optimization. Although the latter is rarely used directly in deep learning, an understanding of 
gradient descent is key to understanding stochastic gradient descent algorithms. For instance, 
the optimization problem might diverge due to an overly large learning rate. This phenomenon 
can already be seen in gradient descent. Likewise, preconditioning is a common technique in 
gradient descent and carries over to more advanced algorithms. Let us start with a simple special 
case. 


11.3.1 Gradient Descent in One Dimension 


Gradient descent in one dimension is an excellent example to explain why the gradient descent 
algorithm may reduce the value of the objective function. Consider some continuously differen- 
tiable real-valued function f : R — R. Using a Taylor expansion (Section 18.3) we obtain that 


f(a +e) = f(x) +ef (a) + O(e). (11.3.1) 


That is, in first approximation f(x + €) is given by the function value f(x) and the first derivative 
f'(x) at x. Itis not unreasonable to assume that for small e moving in the direction of the negative 
gradient will decrease f. To keep things simple we pick a fixed step size y > 0 and choose e = 
—n f'(x). Plugging this into the Taylor expansion above we get 


f(x —nf'(x)) = f(x) - nf? (0) + O(n f? (a). (11.3.2) 


If the derivative f'(x) 4 0 does not vanish we make progress since nf’? (x) > 0. Moreover, we can 
always choose ņ small enough for the higher order terms to become irrelevant. Hence we arrive 
at 


f(x —nf'(z)) E f(z). (11.3.3) 
This means that, if we use 
xz} a2—nf'(x) (11.3.4) 


to iterate x, the value of function f(x) might decline. Therefore, in gradient descent we first choose 
an initial value z and a constant y > 0 and then use them to continuously iterate x until the stop 
condition is reached, for example, when the magnitude of the gradient | f’(x)| is small enough or 
the number of iterations has reached a certain value. 


For simplicity we choose the objective function f(x) = x? to illustrate how to implement gradient 
descent. Although we know that x = 0 is the solution to minimize f(x), we still use this simple 
function to observe how z changes. As always, we begin by importing all required modules. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 
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f = lambda x: xx*xx2 + Objective function 
gradf = lambda x: 2 * x # Its derivative 


Next, we use x = 10 as the initial value and assume y = 0.2. Using gradient descent to iterate x for 
10 times we can see that, eventually, the value of x approaches the optimal solution. 


def gd(eta): 
x = 10.0 
results = [x] 
for i in range(10): 
x -= eta * gradf(x) 
results. append(float(x)) 
print('epoch 10, x:', x) 
return results 


res = gd(0.2) 


epoch 10, x: 0.06046617599999997 


The progress of optimizing over x can be plotted as follows. 
def show_trace(res): 
n = max(abs(min(res)), abs(max(res))) 
f_line = np.arange(-n, n, 0.01) 
d21.set_figsize() 
d21.plot([f_line, res], [Lf(x) for x in f_line], Lf(x) for x in res]], 
Mts (PS OD) 


show_trace(res) 
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Learning Rate 


The learning rate 1 can be set by the algorithm designer. If we use a learning rate that is too small, 
it will cause x to update very slowly, requiring more iterations to get a better solution. To show 
what happens in such a case, consider the progress in the same optimization problem for 7 = 0.05. 
As we can see, even after 10 steps we are still very far from the optimal solution. 


show_trace(gd(0.05)) 


epoch 10, x: 3.4867844009999995 
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Conversely, if we use an excessively high learning rate, |n f’(«)| might be too large for the first-order 
Taylor expansion formula. That is, the term O(n? f”(x)) in (11.3.1) might become significant. In 
this case, we cannot guarantee that the iteration of x will be able to lower the value of f(x). For 
example, when we set the learning rate to 7 = 1.1, x overshoots the optimal solution x = 0 and 
gradually diverges. 


show_trace(gd(1.1)) 


epoch 10, x: 61.917364224000096 
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Local Minima 


To illustrate what happens for nonconvex functions consider the case of f(x) = x - cos cz. This 
function has infinitely many local minima. Depending on our choice of learning rate and de- 
pending on how well conditioned the problem is, we may end up with one of many solutions. 
The example below illustrates how an (unrealistically) high learning rate will lead to a poor local 
minimum. 


c = np.array(0.15 * np.pi) 

f = lambda x: x * np.cos(c * x) 

gradf = lambda x: np.cos(c * x) - c * x * np.sin(c * x) 
show_trace(gd(2)) 


epoch 10, x: -1.5281651 
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11.3.2 Multivariate Gradient Descent 


Now that we have a better intuition of the univariate case, let us consider the situation where 
x € R?. That is, the objective function f : Rf + R maps vectors into scalars. Correspondingly its 
gradient is multivariate, too. Itis a vector consisting of d partial derivatives: 


_ fare) AF) asw L (11.3.5) 
0x1 0x9 Ox 





V f(x) 


Each partial derivative element ðf (x)/0Ox; in the gradient indicates the rate of change of f at x 
with respect to the input x;. As before in the univariate case we can use the corresponding Taylor 
approximation for multivariate functions to get some idea of what we should do. In particular, we 
have that 


fæ + 6) = f(x) + TVF(x) + Olle). (11.3.6) 


In other words, up to second order terms in e the direction of steepest descent is given by the 
negative gradient —V f(x). Choosing a suitable learning rate y > 0 yields the prototypical gradient 
descent algorithm: 


x¢ x—7Vf (x). (11.3.7) 





11.3. Gradient Descent 453 


To see how the algorithm behaves in practice let us construct an objective function f(x) = z? +22 
with a two-dimensional vector x = [1, £2]! as input anda scalar as output. The gradient is given 
by V(x) = [221,422]'. We will observe the trajectory of x by gradient descent from the initial 
position [—5,—2]. We need two more helper functions. The first uses an update function and 
applies it 20 times to the initial value. The second helper visualizes the trajectory of x. 


def train_2d(trainer, steps=20): #@save 
"""Optimize a 2-dim objective function with a customized trainer. 
# sl and s2 are internal state variables and will 
# be used later in the chapter 
Mil, XA, Sil, S2= —5, =2, YO, 
results = [(x1, x2)] 
for i in range(steps): 
x1, x2, sl, s2 = trainer(x1, x2, s1, s2) 
results.append((x1, x2)) 
return results 


nnn 


def show_trace_2d(f, results): #@save 
"""Show the trace of 2D variables during optimization. 
d21.set_figsize() 
d21.p1t.plot(*zip(*results), '-o', color='#ff7f0e') 
x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), 

np.arange(-3.0, 1.0, 0.1)) 

d21.p1t.contour(x1, x2, f(x1, x2), colors='#1f77b4') 
d21.p1t.xlabel('x1'> 
d21.p1t.ylabel('x2'> 


nnn 


Next, we observe the trajectory of the optimization variable x for learning rate 7 = 0.1. We can see 
that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is fairly well-behaved 
albeit rather slow. 


f = lambda x1, x2: x1 xx 2 + 2 * x2 xx 2 + Objective 
gradf = lambda x1, x2: (2 x x1, 4 * x2) + Gradient 


def gd(x1, x2, sl, s2): 
(gl, g2) = gradf(x1, x2) + Compute gradient 
return (x1 - eta x gl, x2 - eta * g2, 0, 0) # Update variables 


eta = 0.1 
show_trace_2d(f, train_2d(gd)) 
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11.3.3 Adaptive Methods 


As we could see in Section 11.3.1, getting the learning rate 7 “just right” is tricky. If we pick it too 
small, we make no progress. If we pick it too large, the solution oscillates and in the worst case it 
might even diverge. What if we could determine ņ automatically or get rid of having to select a step 
size at all? Second order methods that look not only at the value and gradient of the objective but 
also at its curvature can help in this case. While these methods cannot be applied to deep learning 
directly due to the computational cost, they provide useful intuition into how to design advanced 
optimization algorithms that mimic many of the desirable properties of the algorithms outlined 
below. 


Newton's Method 


Reviewing the Taylor expansion of f there is no need to stop after the first term. In fact, we can 
write it as 


faro = f(x) +eT VS (x) + ¿Ve + O(llel>). (11.3.8) 


To avoid cumbersome notation we define Hy := VV! f(x) to be the Hessian of f. This is ad x d 
matrix. For small d and simple problems H+ is easy to compute. For deep networks, on the other 
hand, Hy may be prohibitively large, due to the cost of storing O (d°) entries. Furthermore it may 
be too expensive to compute via backpropagation as we would need to apply backpropagation 
to the backpropagation call graph. For now let us ignore such considerations and look at what 
algorithm we'd get. 


After all, the minimum of f satisfies V f(x) = 0. Taking derivatives of (11.3.8) with regard to e and 
ignoring higher order terms we arrive at 


V f(x) + Hye = 0 and hence e = —H;'V f(x). (11.3.9) 


That is, we need to invert the Hessian Hy as part of the optimization problem. 


For f(z) = $27 we have V f(x) = x and Hy = 1. Hence for any x we obtain e = —x. In other 


words, a single step is sufficient to converge perfectly without the need for any adjustment! Alas, 
we got a bit lucky here since the Taylor expansion was exact. Let us see what happens in other 
problems. 


c = np.array(0.5) 

f = lambda x: np.cosh(c * x) + Objective 

gradf = lambda x: c * np.sinh(c * x) + Derivative 
hessf = lambda x: c*x*2 x np.cosh(c * x) + Hessian 


def newton(eta=1): 

x = 10.0 

results = Lx] 

for i in range(10): 
x -= eta * gradf(x) / hessf(x) 
results. append(float(x)) 

print epoci LOE: n) 

return results 


show_trace(newton()) 
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epoch 10, x: 0.0 
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Now let us see what happens when we have a nonconvex function, such as f(x) = xcos(cx). After 
all, note that in Newton’s method we end up dividing by the Hessian. This means that if the second 
derivative is negative we would walk into the direction of increasing f. That is a fatal flaw of the 
algorithm. Let us see what happens in practice. 

c = np.array(0.15 * np.pi) 

f = lambda x: x * np.cos(c * x) 

gradf = lambda x: np.cos(c * x) - c * x * np.sin(c * x) 


hessf = lambda x: - 2 * c * np.sin(c * x) - x * cx*2 x np.cos(c * x) 


show_trace(newton()) 


epoch 10, x: 26.834133 
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This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian by taking 
its absolute value instead. Another strategy is to bring back the learning rate. This seems to defeat 
the purpose, but not quite. Having second order information allows us to be cautious whenever 
the curvature is large and to take longer steps whenever the objective is flat. Let us see how this 
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works with a slightly smaller learning rate, say 7 = 0.5. As we can see, we have quite an efficient 
algorithm. 


show_trace(newton(0.5)) 


epoch 10, x: 7.26986 
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Convergence Analysis 


We only analyze the convergence rate for convex and three times differentiable f, where at its 
minimum x* the second derivative is nonzero, i.e., where f”(1*) > 0. The multivariate proof is 
a straightforward extension of the argument below and omitted since it doesn't help us much in 
terms of intuition. 


Denote by x; the value of x at the k-th iteration and let ex := x,—x* be the distance from optimality. 
By Taylor series expansion we have that the condition f'(1*) = 0 can be written as 


0 = f' (ur — en) = Fur) — en f (£k) + seat x) (11.3.10) 


This holds for some £p € [£k — ex, £k]. Recall that we have the update £k41 = £k — f'(£k)/ f" (xx). 
Dividing the above expansion by f”(x;) yields 


ex = f'(en)/f" ew) = ASEOS tn) (11.3.11) 


Plugging in the update equations leads to the following bound ex.1 < e? f"(€x)/f'(xx). Conse- 
quently, whenever we are in a region of bounded f" (€p)/f” (£k) < c, we have a quadratically 
decreasing error e441 < ce?. 


As an aside, optimization researchers call this linear convergence, whereas a condition such as 
ex+1 < aez would be called a constant rate of convergence. Note that this analysis comes with 
a number of caveats: We do not really have much of a guarantee when we will reach the region 
of rapid convergence. Instead, we only know that once we reach it, convergence will be very 
quick. Second, this requires that f is well-behaved up to higher order derivatives. It comes down 
to ensuring that f does not have any “surprising” properties in terms of how it might change its 
values. 
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Preconditioning 


Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desir- 
able to find alternatives. One way to improve matters is by avoiding to compute the Hessian in 
its entirety but only compute the diagonal entries. While this is not quite as good as the full New- 
ton method, it is still much better than not using it. Moreover, estimates for the main diagonal 
elements are what drives some of the innovation in stochastic gradient descent optimization al- 
gorithms. This leads to update algorithms of the form 


x e x —ndiag(Hy) 'V f(x). (11.3.12) 


To see why this might be a good idea consider a situation where one variable denotes height in mil- 
limeters and the other one denotes height in kilometers. Assuming that for both the natural scale 
is in meters we have a terrible mismatch in parameterizations. Using preconditioning removes 
this. Effectively preconditioning with gradient descent amounts to selecting a different learning 
rate for each coordinate. 


Gradient Descent with Line Search 


One of the key problems in gradient descent was that we might overshoot the goal or make insuf- 
ficient progress. A simple fix for the problem is to use line search in conjunction with gradient 
descent. That is, we use the direction given by V f(x) and then perform binary search as to which 
step length 7 minimizes f(x — nV f(x). 


This algorithm converges rapidly (for an analysis and proof see e.g., (Boyd & Vandenberghe, 
2004)). However, for the purpose of deep learning this is not quite so feasible, since each step 
of the line search would require us to evaluate the objective function on the entire dataset. This is 
way too costly to accomplish. 


Summary 


+ Learning rates matter. Too large and we diverge, too small and we do not make progress. 
e Gradient descent can get stuck in local minima. 

e In high dimensions adjusting the learning rate is complicated. 

e Preconditioning can help with scale adjustment. 

e Newton's method is a lot faster once it has started working properly in convex problems. 


+ Beware of using Newton’s method without any adjustments for nonconvex problems. 


Exercises 


1. Experiment with different learning rates and objective functions for gradient descent. 
2. Implement line search to minimize a convex function in the interval [a, b]. 


e Do you need derivatives for binary search, i.e., to decide whether to pick |a, (a + b) /2] 
or [(a + b)/2, b]. 


e How rapid is the rate of convergence for the algorithm? 
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e Implement the algorithm and apply it to minimizing log(exp(x) + exp(—2 x x — 3)). 


3. Design an objective function defined on R? where gradient descent is exceedingly slow. Hint: 
scale different coordinates differently. 


4. Implement the lightweight version of Newton’s method using preconditioning: 
+ Use diagonal Hessian as preconditioner. 
+ Use the absolute values of that rather than the actual (possibly signed) values. 
+ Apply this to the problem above. 


5. Apply the algorithm above to a number of objective functions (convex or not). What happens 
if you rotate coordinates by 45 degrees? 


Discussions!*! 


11.4 Stochastic Gradient Descent 


In this section, we are going to introduce the basic principles of stochastic gradient descent. 


%matplotlib inline 

from d21 import mxnet as d21 
import math 

from mxnet import np, npx 
npx.set_np() 


11.4.1 Stochastic Gradient Updates 


In deep learning, the objective function is usually the average of the loss functions for each exam- 
ple in the training dataset. We assume that f,(x) is the loss function of the training dataset with n 
examples, an index of i, and parameter vector of x, then we have the objective function 


1 n 
f(x) = 2 fia). (11.4.1) 
The gradient of the objective function at x is computed as 
1 n 
Via) = = NV f(x). (11.4.2) 
i=1 


If gradient descent is used, the computing cost for each independent variable iteration is O(n), 
which grows linearly with n. Therefore, when the model training dataset is large, the cost of gra- 
dient descent for each iteration will be very high. 


Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration 
of stochastic gradient descent, we uniformly sample an index i € [1,...,n) for data examples at 
random, and compute the gradient V f,(x) to update x: 


xx-nVf(x). (11.4.3) 





131 https://discuss.d21.ai/t/351 
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Here, 7 is the learning rate. We can see that the computing cost for each iteration drops from 
O(n) of the gradient descent to the constant O(1). We should mention that the stochastic gradient 
V fi(x) is the unbiased estimate of gradient V f(x). 














BV fix) = ŻY VI) = VFR). (11.4.4) 
i=1 


This means that, on average, the stochastic gradient is a good estimate of the gradient. 


Now, we will compare it to gradient descent by adding random noise with a mean of 0 and a vari- 
ance of 1 to the gradient to simulate a SGD. 


f = lambda x1, x2: x1 xx 2 + 2 * x2 xx 2 # Objective 
gradf = lambda x1, x2: (2 x x1, 4 x x2) + Gradient 


def sgd(x1, x2, sl, s2): 
global 1r + Learning rate scheduler 
(gl, g2) = gradf(x1, x2) 
# Simulate noisy gradient 
gl += np.random.normal(0.0, 1, (1,)) 
g2 += np.random.normal(0.0, 1, (1,)) 
eta_t = eta x 1r() # Learning rate at time t 
return (x1 - eta_t * gl, x2 - eta_t * g2, 0, 0) + Update variables 


eta = 0.1 
lr = (lambda: 1) # Constant learning rate 
d21.show_trace_2d(f, d21.train_2d(sgd, steps=50)) 


As we can see, the trajectory of the variables in the SGD is much more noisy than the one we 
observed in gradient descent in the previous section. This is due to the stochastic nature of the 
gradient. That is, even when we arrive near the minimum, we are still subject to the uncertainty 
injected by the instantaneous gradient via nV f,(x). Even after 50 steps the quality is still not so 
good. Even worse, it will not improve after additional steps (we encourage the reader to exper- 
iment with a larger number of steps to confirm this on his own). This leaves us with the only 
alternative—change the learning rate 7. However, if we pick this too small, we will not make any 
meaningful progress initially. On the other hand, if we pick it too large, we will not get a good 
solution, as seen above. The only way to resolve these conflicting goals is to reduce the learning 
rate dynamically as optimization progresses. 





460 Chapter 11. Optimization Algorithms 


This is also the reason for adding a learning rate function 1r into the sgd step function. In the 
example above any functionality for learning rate scheduling lies dormant as we setthe associated 
lr function to be constant, i.e., lr = (lambda: 1). 


11.4.2 Dynamic Learning Rate 


Replacing n with a time-dependent learning rate n(t) adds to the complexity of controlling conver- 
gence of an optimization algorithm. In particular, need to figure out how rapidly 7 should decay. 
If it is too quick, we will stop optimizing prematurely. If we decrease it too slowly, we waste too 
much time on optimization. There are a few basic strategies that are used in adjusting 7 over time 
(we will discuss more advanced strategies in a later chapter): 


n(t) = mift; < t < ti}ı piecewise constant 


At 


n(t) =o: e. exponential (11.4.5) 


n(t) =no-(6t+1)° — polynomial 


In the first scenario we decrease the learning rate, e.g., whenever progress in optimization has 
stalled. This is a common strategy for training deep networks. Alternatively we could decrease it 
much more aggressively by an exponential decay. Unfortunately this leads to premature stopping 
before the algorithm has converged. A popular choice is polynomial decay with a = 0.5. In the 
case of convex optimization there are a number of proofs which show that this rate is well behaved. 
Let us see what this looks like in practice. 


def exponential(): 
global ctr 
ctr += 1 
return math.exp(-0.1 * ctr) 


ctr = 1 
lr = exponential + Set up learning rate 
d21.show_trace_2d(f, d21.train_2d(sgd, steps=1000)) 


As expected, the variance in the parameters is significantly reduced. However, this comes at the 
expense of failing to converge to the optimal solution x = (0,0). Even after 1000 steps are we are 
still very far away from the optimal solution. Indeed, the algorithm fails to converge at all. On the 
other hand, if we use a polynomial decay where the learning rate decays with the inverse square 
root of the number of steps convergence is good. 
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def polynomial(): 
global ctr 
ctr += 1 
return (1 + 0.1 * ctr)xx(-0.5) 


ctr = 1 
lr = polynomial + Set up learning rate 
d21.show_trace_2d(f, d21.train_2d(sgd, steps=50)) 
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There exist many more choices for how to set the learning rate. For instance, we could start with a 
small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even al- 
ternate between smaller and larger learning rates. There exists a large variety of such schedules. 
For now let us focus on learning rate schedules for which a comprehensive theoretical analysis 
is possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very 
difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear 
nonconvex problems is NP hard. For a survey see e.g., the excellent lecture notes!** of Tibshirani 
2015. 


11.4.3 Convergence Analysis for Convex Objectives 


The following is optional and primarily serves to convey more intuition about the problem. We 
limit ourselves to one of the simplest proofs, as described by (Nesterov & Vial, 2000). Significantly 
more advanced proof techniques exist, e.g., whenever the objective function is particularly well 
behaved. (Hazan et al., 2008) show that for strongly convex functions, i.e., for functions that can 
be bounded from below by x! Qx, it is possible to minimize them in a small number of steps while 
decreasing the learning rate like 7(t) = no /(St+ 1). Unfortunately this case never really occurs in 
deep learning and we are left with a much more slowly decreasing rate in practice. 


Consider the case where 
Wii = Wt — Owl (Xi, W). (11.4.6) 


In particular, assume that x; is drawn from some distribution P(x) and that 1(x, w) is a convex 
function in w for all x. Last denote by 


R(w) = Exp (Ux, w)] (11.4.7) 





132 hitps://www.stat.cmu.edu/~ryantibs/convexopt-F15/lectures/26-nonconvex.pdf 
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the expected risk and by R* its minimum with regard to w. Last let w* be the minimizer (we 
assume that it exists within the domain which w is defined). In this case we can track the distance 
between the current parameter w; and the risk minimizer w* and see whether it improves over 
time: 


[wi = w? = [ws — m0! (xo, w) — Ww" ||? 


E eT 2 pa (11.4.8) 
= |[we — wE + ng ôw, w) ||" — 2171 (We — W", Owl (Xt, W)) - 


The gradient Ow/(x;, w) can be bounded from above by some Lipschitz constant L, hence we have 
that 


ng lw! (Xe, W)||? < np L?. (11.4.9) 


We are mostly interested in how the distance between w; and w* changes in expectation. In fact, 
for any specific sequence of steps the distance might well increase, depending on whichever x, 
we encounter. Hence we need to bound the inner product. By convexity we have that 


L(x, w*) > (Xi, we) + (w* — We, Owl (Xz, W+)) - (11.4.10) 


Using both inequalities and plugging it into the above we obtain a bound on the distance between 
parameters at time t + 1 as follows: 


w = Ww" ||? — wea — we] > 274(U(%+, we) — (Ke, w*)) — np D. (11.4.11) 


This means that we make progress as long as the expected difference between current loss and 
the optimal loss outweighs 1,L?. Since the former is bound to converge to 0 it follows that the 
learning rate 7, also needs to vanish. 


Next we take expectations over this expression. This yields 
Ew, [we — w117] — Ey, ste [wera — wI] > 2m[E[Riwi]] — RY] — nL’. (11.4.12) 


The last step involves summing over the inequalities for t € {t,..., T}. Since the sum telescopes 
and by dropping the lower term we obtain 


T T 
[wo — w* l? > 25 m[ElRiwd]] — R*]- 12) n. (11.4.13) 
t=1 t=1 
Note that we exploited that wo is given and thus the expectation can be dropped. Last define 
T 
w:= Lata HWE, (11.4.14) 
J= Mt 
Then by convexity it follows that 
XO 1E[R[w]] > Y m- [Ew]. (11.4.15) 
t 


Plugging this into the above inequality yields the bound 
rt Date 
2 ee Mt 


Here r? := ||wo — w*||? is a bound on the distance between the initial choice of parameters and 
the final outcome. In short, the speed of convergence depends on how rapidly the loss function 
changes via the Lipschitz constant L and how far away from optimality the initial value is r. Note 
that the bound is in terms of w rather than wr. This is the case since w is a smoothed version of 
the optimization path. Now let us analyze some choices for ng. 


lE[w]] - R* < 


(11.4.16) 








11.4. Stochastic Gradient Descent 463 


+ Known Time Horizon. Whenever r, L and T are known we can pick y = r /LyT. This yields 
as upper bound rL(1+1/T)/2VT < rL/vT. That is, we converge with rate O(1/vT) to the 
optimal solution. 


* Unknown Time Horizon. Whenever we want to have a good solution for any time T we can 
pick n = O(1/VT). This costs us an extra logarithmic factor and it leads to an upper bound 
of the form O(log T/VT). 


Note that for strongly convex losses I(x, w’) > I(x, w) + (w — w, Owl(x,w)) + 3||w — w’||? we can 
design even more rapidly converging optimization schedules. In fact, an exponential decay in 7 
leads to a bound of the form O(log T/T). 


11.4.4 Stochastic Gradients and Finite Samples 


So far we have played a bit fast and loose when it comes to talking about stochastic gradient de- 
scent. We posited that we draw instances z;, typically with labels y; from some distribution p(x, y) 
and that we use this to update the weights w in some manner. In particular, for a finite sample size 
we simply argued that the discrete distribution p(x, y) = + $>] 6x, (1)0,, (y) allows us to perform 


n 
SGD over it. 


However, this is not really what we did. In the toy examples in the current section we simply 
added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs (;, y;). It 
turns out that this is justified here (see the exercises for a detailed discussion). More troubling is 
that in all previous discussions we clearly did not do this. Instead we iterated over all instances 
exactly once. To see why this is preferable consider the converse, namely that we are sampling 
n observations from the discrete distribution with replacement. The probability of choosing an 
element i at random is N~!. Thus to choose it at least once is 


P(choose i) = 1 — P(omit i) = 1 — (1 — N71)" ~ 1 — e™! © 0.63. (11.4.17) 


A similar reasoning shows that the probability of picking a sample exactly once is given by 
EON(1 NHN = N1 — N71)" & e~! & 0.37. This leads to an increased variance 
and decreased data efficiency relative to sampling without replacement. Hence, in practice we 
perform the latter (and this is the default choice throughout this book). Last note that repeated 


passes through the dataset traverse it in a different random order. 


Summary 


For convex problems we can prove that for a wide choice of learning rates Stochastic Gradi- 
ent Descent will converge to the optimal solution. 


For deep learning this is generally not the case. However, the analysis of convex problems 
gives us useful insight into how to approach optimization, namely to reduce the learning 
rate progressively, albeit not too quickly. 


Problems occur when the learning rate is too small or too large. In practice a suitable learn- 
ing rate is often found only after multiple experiments. 


When there are more examples in the training dataset, it costs more to compute each itera- 
tion for gradient descent, so SGD is preferred in these cases. 


Optimality guarantees for SGD are in general not available in nonconvex cases since the 
number of local minima that require checking might well be exponential. 
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Exercises 


1. Experiment with different learning rate schedules for SGD and with different numbers of 
iterations. In particular, plot the distance from the optimal solution (0,0) as a function of 
the number of iterations. 


2. Prove that for the function f (x1, £2) = x? + 273 adding normal noise to the gradient is equiv- 
alent to minimizing a loss function I(x, w) = (x1 — w1)? + 2(12— wa)? where x is drawn from 
a normal distribution. 


* Derive mean and variance of the distribution for x. 


* Show that this property holds in general for objective functions f(x) = (x—p)' Q(x- 
u) for Q = 0. 


3. Compare convergence of SGD when you sample from [(x1,Y1),..., (Um, Ym)) with replace- 
ment and when you sample without replacement. 


4. How would you change the SGD solver if some gradient (or rather some coordinate associ- 
ated with it) was consistently larger than all other gradients? 


5. Assume that f(x) = 2?(1 + sin z). How many local minima does f have? Can you change f 
in such a way that to minimize it one needs to evaluate all local minima? 


Discussions??? 


11.5 Minibatch Stochastic Gradient Descent 


So far we encountered two extremes in the approach to gradient based learning: Section 11.3 uses 
the full dataset to compute gradients and to update parameters, one pass at a time. Conversely 
Section 11.4 processes one observation at a time to make progress. Each of them has its own draw- 
backs. Gradient Descent is not particularly data efficient whenever data is very similar. Stochastic 
Gradient Descent is not particularly computationally efficient since CPUs and GPUs cannot exploit 
the full power of vectorization. This suggests that there might be a happy medium, and in fact, 
that's what we have been using so far in the examples we discussed. 


11.5.1 Vectorization and Caches 


At the heart of the decision to use minibatches is computational efficiency. This is most easily 
understood when considering parallelization to multiple GPUs and multiple servers. In this case 
we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already 
arrive at a minibatch size of 128. 


Things are a bit more subtle when it comes to single GPUs or even CPUs. These devices have mul- 
tiple types of memory, often multiple type of compute units and different bandwidth constraints 
between them. For instance, a CPU has a small number of registers and then L1, L2 and in some 
cases even L3 cache (which is shared between the different processor cores). These caches are of 
increasing size and latency (and at the same time they are of decreasing bandwidth). Suffice it to 
say, the processor is capable of performing many more operations than what the main memory 
interface is able to provide. 





133 https://discuss.d21.ai/t/352 
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+ A2GHz CPU with 16 cores and AVX-512 vectorization can process up to 2- 10° - 16-32 = 101? 
bytes per second. The capability of GPUs easily exceeds this number by a factor of 100. 
On the other hand, a midrange server processor might not have much more than 100 GB/s 
bandwidth, i.e., less than one tenth of what would be required to keep the processor fed. To 
make matters worse, not all memory access is created equal: first, memory interfaces are 
typically 64 bit wide or wider (e.g., on GPUs up to 384 bit), hence reading a single byte incurs 
the cost of a much wider access. 


There is significant overhead for the first access whereas sequential access is relatively cheap 
(this is often called a burst read). There are many more things to keep in mind, such as 
caching when we have multiple sockets, chiplets and other structures. A detailed discussion 
of this is beyond the scope of this section. See e.g., this Wikipedia article!** for a more in- 
depth discussion. 


The way to alleviate these constraints is to use a hierarchy of CPU caches which are actually fast 
enough to supply the processor with data. This is the driving force behind batching in deep learn- 
ing. To keep matters simple, consider matrix-matrix multiplication, say A = BC. We have a num- 
ber of options for calculating A. For instance we could try the following: 


1. We could compute A;; = B,C, i.e., we could compute it elementwise by means of dot 
products. 


2. We could compute A. ; = BC. ,, i.e., we could compute it one column at a time. Likewise we 
could compute A one row A, . at a time. 


3. We could simply compute A = BC. 
4. We could break B and C into smaller block matrices and compute A one block at a time. 


If we follow the first option, we will need to copy one row and one column vector into the CPU 
each time we want to compute an element A;;. Even worse, due to the fact that matrix elements 
are aligned sequentially we are thus required to access many disjoint locations for one of the two 
vectors as we read them from memory. The second option is much more favorable. In it, we are 
able to keep the column vector C, ; in the CPU cache while we keep on traversing through B. This 
halves the memory bandwidth requirement with correspondingly faster access. Of course, option 
3 is most desirable. Unfortunately, most matrices might not entirely fit into cache (this is what we 
are discussing after all). However, option 4 offers a practically useful alternative: we can move 
blocks of the matrix into cache and multiply them locally. Optimized libraries take care of this for 
us. Let us have a look at how efficient these operations are in practice. 


Beyond computational efficiency, the overhead introduced by Python and by the deep learning 
framework itself is considerable. Recall that each time we execute a command the Python in- 
terpreter sends a command to the MXNet engine which needs to insert it into the computational 
graph and deal with it during scheduling. Such overhead can be quite detrimental. In short, it is 
highly advisable to use vectorization (and matrices) whenever possible. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


timer = d21.Timer() 


(continues on next page) 





 https://en.wikipedia.org/wiki/Cache_hierarchy 
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(continued from previous page) 
A = np.zeros((256, 256)) 


np.random.normal(0, 1, (256, 256)) 
np.random.normal(0, 1, (256, 256)) 


O Ww 
owl 


Element-wise assignment simply iterates over all rows and columns of B and C respectively to 
assign the value to A. 


# Compute A = BC one element at a time 
timer.start() 
for i in range(256): 
for j in range(256): 
AB, Ss ns codi, sl, (ES, 31) 
A.wait_to_read() 
timer.stop() 


61.369303941726685 


A faster strategy is to perform column-wise assignment. 


# Compute A = BC one column at a time 
timer.start() 
for j in range(256): 

AE = np dot B CEA T) 
A.wait_to_read() 
timer.stop() 


0. 1847524642944336 


Last, the most effective manner is to perform the entire operation in one block. Let us see what 
the respective speed of the operations is. 


# Compute A = BC in one go 
timer.start() 

A = np.dot(B, C) 
A.wait_to_read() 
timer.stop() 


# Multiply and add count as separate operations (fused in practice) 

gigaflops = [2/i for i in timer.times] 

print(f’performance in Gigaflops: element {gigaflops[Q]: .3f}, 
f’column {gigaflops[1]:.3f}, full {gigaflops[2]: .3f}') 


, 


performance in Gigaflops: element 0.033, column 10.825, full 2669.831 
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11.5.2 Minibatches 


In the past we took it for granted that we would read minibatches of data rather than single observa- 
tions to update parameters. We now give a brief justification for it. Processing single observations 
requires us to perform many single matrix-vector (or even vector-vector) multiplications, which is 
quite expensive and which incurs a significant overhead on behalf of the underlying deep learning 
framework. This applies both to evaluating a network when applied to data (often referred to as 
inference) and when computing gradients to update parameters. That is, this applies whenever 
we perform w + w — mg: where 


g: = w f (Xt, w) (11.5.1) 


We can increase the computational efficiency of this operation by applying it to a minibatch of 
observations at a time. That is, we replace the gradient g; over a single observation by one over a 
small batch 


1 
Es = diga Y > (xi, w) (11.5.2) 


icb: 


Let us see what this does to the statistical properties of g;: since both x; and also all elements 
of the minibatch 6; are drawn uniformly at random from the training set, the expectation of the 
gradient remains unchanged. The variance, on the other hand, is reduced significantly. Since the 
minibatch gradient is composed of b := |B,| independent gradients which are being averaged, its 
standard deviation is reduced by a factor of b73, This, by itself, is a good thing, since it means that 
the updates are more reliably aligned with the full gradient. 


Naively this would indicate that choosing a large minibatch 6, would be universally desirable. 
Alas, after some point, the additional reduction in standard deviation is minimal when compared 
to the linear increase in computational cost. In practice we pick a minibatch that is large enough 
to offer good computational efficiency while still fitting into the memory of a GPU. To illustrate the 
savings let us have a look at some code. In it we perform the same matrix-matrix multiplication, 
but this time broken up into “minibatches” of 64 columns at a time. 


timer.start() 
for j in range(0, 256, 64): 
AL:, j:j+64] = np.dot(B, C[:, j:j+64]) 
timer.stop() 
print(f’performance in Gigaflops: block {2 / timer.times[3]: .3f}') 


performance in Gigaflops: block 624.989 


As we can see, the computation on the minibatch is essentially as efficient as on the full matrix. A 
word of caution is in order. In Section 7.5 we used a type of regularization that was heavily depen- 
dent on the amount of variance in a minibatch. As we increase the latter, the variance decreases 
and with it the benefit of the noise-injection due to batch normalization. See e.g., (Ioffe, 2017) for 
details on how to rescale and compute the appropriate terms. 
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11.5.3 Reading the Dataset 


Let us have a look at how minibatches are efficiently generated from data. In the following we use 
a dataset developed by NASA to test the wing noise from different aircraft!* to compare these opti- 
mization algorithms. For convenience we only use the first 1,500 examples. The data is whitened 
for preprocessing, i.e., we remove the mean and rescale the variance to 1 per coordinate. 


#@save 
d21.DATA_HUBL’airfoil’] = (d21.DATA_URL + 'airfoil_self_noise.dat', 
'76e5be1548fd8222e5074cf0faae75bedff8cf93f”>) 


#@save 
def get_data_ch11(batch_size=10, n=1500): 
data = np.genfromtxt(d21.download('airfoil’), 
dtype=np.float32, delimiter='\t’') 
data = (data - data.mean(axis=0)) / data.std(axis=0) 
data_iter = d21.load_array( 
(data[l:n, :-1], datal:n, -1]), batch_size, is_train=True) 
return data_iter, data.shape[1]-1 


11.5.4 Implementation from Scratch 


Recall the minibatch SGD implementation from Section 3.2. In the following we provide a slightly 
more general implementation. For convenience it has the same call signature as the other opti- 
mization algorithms introduced later in this chapter. Specifically, we add the status input states 
and place the hyperparameter in dictionary hyperparams. In addition, we will average the loss of 
each minibatch example in the training function, so the gradient in the optimization algorithm 
does not need to be divided by the batch size. 


def sgd(params, states, hyperparams): 
for p in params: 
pL:] -= hyperparams['1r'] * p.grad 


Next, we implement a generic training function to facilitate the use of the other optimization al- 
gorithms introduced later in this chapter. It initializes a linear regression model and can be used 
to train the model with minibatch SGD and other algorithms introduced subsequently. 


#@save 
def train_ch11(trainer_fn, states, hyperparams, data_iter, 
feature_dim, num_epochs=2): 
# Initialization 
w = np.random.normal(scale=0.01, size=(feature_dim, 1)) 
b = np.zeros(1) 
w.attach_grad() 
b.attach_grad() 
net, loss = lambda X: d21.linreg(X, w, b), d21.squared_loss 
# Train 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[@, num_epochs], ylim=[0.22, 0.35]) 
n, timer = 0, d21.Timer() 


(continues on next page) 





135 https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise 
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(continued from previous page) 


for _ in range(num_epochs): 
for X, y in data_iter: 
with autograd.record(): 
1 = loss(net(X), y).mean() 
1. backward() 
trainer_fn([w, b], states, hyperparams) 
n += X.shapeL0] 
if n % 200 == Q: 
timer.stop() 
animator .add(n/X.shape[0]/len(data_iter), 
(d21.evaluate_loss(net, data_iter, loss),)) 
timer.start() 
print(f’loss: {animator.Y[@][-1]:.3f}, {timer.avg():.3f} sec/epoch’) 
return timer.cumsum(), animator.Y[0] 


Let us see how optimization proceeds for batch gradient descent. This can be achieved by setting 
the minibatch size to 1500 (i.e., to the total number of examples). As a result the model parameters 
are updated only once per epoch. There is little progress. In fact, after 6 steps progress stalls. 


def train_sgd(1r, batch_size, num_epochs=2): 
data_iter, feature_dim = get_data_ch11(batch_size) 


return train_ch11( 
sgd, None, {'lr’: 1r}, data_iter, feature_dim, num_epochs) 


gd_res = train_sgd(1, 1500, 10) 
loss: 0.254, 0.074 sec/epoch 
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When the batch size equals 1, we use SGD for optimization. For simplicity of implementation we 
picked a constant (albeit small) learning rate. In SGD, the model parameters are updated when- 
ever an example is processed. In our case this amounts to 1500 updates per epoch. As we can see, 
the decline in the value of the objective function slows down after one epoch. Although both the 
procedures processed 1500 examples within one epoch, SGD consumes more time than gradient 
descent in our experiment. This is because SGD updated the parameters more frequently and 
since it is less efficient to process single observations one at a time. 
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sgd_res = train_sgd(0.005, 1) 


loss: 0.243, 0.432 sec/epoch 
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Finally, when the batch size equals 100, we use minibatch SGD for optimization. The time required 
per epoch is shorter than the time needed for SGD and the time for batch gradient descent. 


minil_res = train_sgd(.4, 100) 


loss: 0.244, 0.009 sec/epoch 
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Reducing the batch size to 10, the time for each epoch increases because the workload for each 
batch is less efficient to execute. 


mini2_res = train_sgd(.05, 10) 


loss: 0.247, 0.051 sec/epoch 
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Now we can compare the time vs. loss for the previous four experiments. As can be seen, al- 
though SGD converges faster than GD in terms of number of examples processed, it uses more 
time to reach the same loss than GD because computing the gradient example by example is not 
as efficient. Minibatch SGD is able to trade-off convergence speed and computation efficiency. A 
minibatch size of 10 is more efficient than SGD; a minibatch size of 100 even outperforms GD in 
terms of runtime. 


d21.set_figsize([6, 3]) 
d21.plot(*list(map(list, zip(gd_res, sgd_res, minil_res, mini2_res))), 
"time (sec)', ‘loss’, xlim=[le-2, 10], 
legend=['gd', 'sgd', ‘batch size=100', 'batch size=10']) 
d21.plt.gca().set_xscale(' log’) 


— gd 

=== sgd 

—-- batch size=100 
batch size=10 





107? 107? 10% 10! 
time (sec) 
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11.5.5 Concise Implementation 


In Gluon, we can use the Trainer class to call optimization algorithms. This is used to implement 
a generic training function. We will use this throughout the current chapter. 


#@save 
def train_concise_ch11(tr_name, hyperparams, data_iter, num_epochs=2): 
# Initialization 
net = nn.Sequential() 
net .add(nn.Dense(1)) 
net.initialize(init.Normal(sigma=0.01)) 
trainer = gluon.Trainer(net.collect_params(), tr_name, hyperparams) 
loss = gluon.loss.L2Loss() 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[@, num_epochs], ylim=[0.22, 0.35]) 
n, timer = 0, d21.Timer() 
for _ in range(num_epochs): 
for X, y in data_iter: 
with autograd.record(): 
1 = loss(net(X), y) 
1. backward() 
trainer.step(X.shape[0]) 
n += X.shapeL0] 
if n % 200 == 0: 
timer.stop() 
animator .add(n/X.shape[@]/len(data_iter), 
(d21.evaluate_loss(net, data_iter, loss),)) 
timer.start() 
print(f'loss: {animator.Y[@][-1]:.3f}, {timer.avg():.3f} sec/epoch’) 


Using Gluon to repeat the last experiment shows identical behavior. 


data_iter, = get_data_ch11(10) 
train_concise_chl1('sgd’, {’learning_rate’: 0.05), data_iter) 





loss: 0.243, 0.050 sec/epoch 
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Summary 


Vectorization makes code more efficient due to reduced overhead arising from the deep 
learning framework and due to better memory locality and caching on CPUs and GPUs. 


There is a trade-off between statistical efficiency arising from SGD and computational effi- 
ciency arising from processing large batches of data at a time. 


Minibatch stochastic gradient descent offers the best of both worlds: computational and 
statistical efficiency. 


In minibatch SGD we process batches of data obtained by a random permutation of the train- 
ing data (i.e., each observation is processed only once per epoch, albeit in random order). 


It is advisable to decay the learning rates during training. 


In general, minibatch SGD is faster than SGD and gradient descent for convergence to a 
smaller risk, when measured in terms of clock time. 


Exercises 


1. Modify the batch size and learning rate and observe the rate of decline for the value of the 


objective function and the time consumed in each epoch. 


. Read the MXNet documentation and use the Trainer class set_learning_rate function to 


reduce the learning rate of the minibatch SGD to 1/10 of its previous value after each epoch. 


. Compare minibatch SGD with a variant that actually samples with replacement from the train- 


ing set. What happens? 


4, An evil genie replicates your dataset without telling you (i.e., each observation occurs twice 


and your dataset grows to twice its original size, but nobody told you). How does the behavior 
of SGD, minibatch SGD and that of gradient descent change? 


Discussions!*6 


11.6 Momentum 


In Section 11.4 we reviewed what happens when performing stochastic gradient descent, i.e., 
when performing optimization where only a noisy variant of the gradient is available. In partic- 
ular, we noticed that for noisy gradients we need to be extra cautious when it comes to choosing 
the learning rate in the face of noise. If we decrease it too rapidly, convergence stalls. If we are 
too lenient, we fail to converge to a good enough solution since noise keeps on driving us away 
from optimality. 





136 https://discuss.d21.ai/t/353 
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Chapter 11. Optimization Algorithms 


11.6.1 Basics 


In this section, we will explore more effective optimization algorithms, especially for certain types 
of optimization problems that are common in practice. 


Leaky Averages 


The previous section saw us discussing minibatch SGD as a means for accelerating computation. 
It also had the nice side-effect that averaging gradients reduced the amount of variance. The mini- 
batch SGD can be calculated by: 


1 
841-1 = Ow] | y F(X;, We-1) = 5 | Sh, (11.6.1) 


icb 1€B; 


To keep the notation simple, here we used h; 1 = Owf(X;, w;_1) as the SGD for sample i using 
the weights updated at time t — 1. It would be nice if we could benefit from the effect of variance 
reduction even beyond averaging gradients on a minibatch. One option to accomplish this task is 
to replace the gradient computation by a “leaky average”: 


= PVi-1 + Bet-1 (11.6.2) 


for some £ € (0, 1). This effectively replaces the instantaneous gradient by one that's been aver- 
aged over multiple past gradients. v is called momentum. It accumulates past gradients similar to 
how a heavy ball rolling down the objective function landscape integrates over past forces. To see 
what is happening in more detail let us expand v; recursively into 


Vi = BVI + 6811-2 + Bet = ++ = yy B’ St—-rt—7-1- (11.6.3) 


Large 8 amounts to a long-range average, whereas small $ amounts to only a slight correction 
relative to a gradient method. The new gradient replacement no longer points into the direction 
of steepest descent on a particular instance any longer but rather in the direction of a weighted 
average of past gradients. This allows us to realize most of the benefits of averaging over a batch 
without the cost of actually computing the gradients on it. We will revisit this averaging procedure 
in more detail later. 


The above reasoning formed the basis for what is now known as accelerated gradient methods, 
such as gradients with momentum. They enjoy the additional benefit of being much more effec- 
tive in cases where the optimization problem is ill-conditioned (i.e., where there are some direc- 
tions where progress is much slower than in others, resembling a narrow canyon). Furthermore, 
they allow us to average over subsequent gradients to obtain more stable directions of descent. 
Indeed, the aspect of acceleration even for noise-free convex problems is one of the key reasons 
why momentum works and why it works so well. 


As one would expect, due to its efficacy momentum is a well-studied subject in optimization for 
deep learning and beyond. See e.g., the beautiful expository article**” by (Goh, 2017) for an in- 
depth analysis and interactive animation. It was proposed by (Polyak, 1964). (Nesterov, 2018) 
has a detailed theoretical discussion in the context of convex optimization. Momentum in deep 
learning has been known to be beneficial for a long time. See e.g., the discussion by (Sutskever et 
al., 2013) for details. 





137 https://distill.pub/2017/momentum/ 
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An Ill-conditioned Problem 


To get a better understanding of the geometric properties of the momentum method we revisit 
gradient descent, albeit with a significantly less pleasant objective function. Recall that in Section 
11.3 we used f(x) = x? + 222, i.e., a moderately distorted ellipsoid objective. We distort this 
function further by stretching it out in the x; direction via 


f(x) = 0.1x? + 222. (11.6.4) 


As before f has its minimum at (0,0). This function is very flat in the direction of xı. Let us 
see what happens when we perform gradient descent as before on this new function. We pick a 
learning rate of 0.4. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


eta = 0.4 
def f_2d(x1, x2): 
RETURN AZ 
def gd_2d(x1, x2, sl, s2): 
return (x1 - eta x 0.2 x x1, x2 - eta * 4 x x2, 0, 0) 


d21.show_trace_2d(f_2d, d21.train_2d(gd_2d)) 





By construction, the gradient in the x2 direction is much higher and changes much more rapidly 
than in the horizontal x; direction. Thus we are stuck between two undesirable choices: if we 
pick a small learning rate we ensure that the solution does not diverge in the x2 direction but we 
are saddled with slow convergence in the x; direction. Conversely, with a large learning rate we 
progress rapidly in the xı direction but diverge in x2. The example below illustrates what hap- 
pens even after a slight increase in learning rate from 0.4 to 0.6. Convergence in the x; direction 
improves but the overall solution quality is much worse. 


eta = 0.6 
d21.show_trace_2d(f_2d, d21.train_2d(gd_2d)) 
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1000 
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The Momentum Method 


The momentum method allows us to solve the gradient descent problem described above. Look- 
ing at the optimization trace above we might intuit that averaging gradients over the past would 
work well. After all, in the x; direction this will aggregate well-aligned gradients, thus increasing 
the distance we cover with every step. Conversely, in the x2 direction where gradients oscillate, 
an aggregate gradient will reduce step size due to oscillations that cancel each other out. Using v; 
instead of the gradient g; yields the following update equations: 


Vi E BVi_-1 + Bit-1, 
Xt — Xt_-1 — MVt. 


(11.6.5) 


Note that for 8 = 0 we recover regular gradient descent. Before delving deeper into the mathe- 
matical properties let us have a quick look at how the algorithm behaves in practice. 


def momentum_2d(x1, x2, v1, v2): 
vl = beta * vl + 0.2 * x1 
v2 = beta * v2 + 4 * x2 
return x1 - eta * vl, x2 - eta * v2, v1, v2 


eta, beta = 0.6, 0.5 
d21.show_trace_2d(f_2d, d21.train_2d(momentum_2d) ) 


ig 


| — 
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As we can see, even with the same learning rate that we used before, momentum still converges 
well. Let us see what happens when we decrease the momentum parameter. Halving it to 8 = 
0.25 leads to a trajectory that barely converges at all. Nonetheless, it is a lot better than without 
momentum (when the solution diverges). 


eta, beta = 0.6, 0.25 
d21.show_trace_2d(f_2d, d21.train_2d(momentum_2d)) 





Note that we can combine momentum with SGD and in particular, minibatch-SGD. The only 
change is that in that case we replace the gradients g: ¿1 with g+. Last, for convenience we initial- 
ize Vo = 0 at time t = 0. Let us look at what leaky averaging actually does to the updates. 


Effective Sample Weight 


Recall that v, = Saar B" S:-7+1-7-1- In the limit the terms add up to > 2,8” = 1 In other 
words, rather than taking a step of size 7 in GD or SGD we take a step of size 27 while at the same 
time, dealing with a potentially much better behaved descent direction. These are two benefits in 
one. To illustrate how weighting behaves for different choices of 8 consider the diagram below. 


d21.set_figsize() 
betas = [0.95, 0.9, 0.6, @] 
for beta in betas: 
x = np.arange(4Q) .asnumpy() 
d21.plt.plot(x, betax*x, label=f'beta = (beta: .2f)') 
d21.plt.xlabel(’ time’) 
d21.plt.legend() 


<matplotlib.legend.Legend at 0x7f4c70037dc0> 
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beta = 0.95 
beta = 0.90 
beta = 0.60 
beta = 0.00 





time 


11.6.2 Practical Experiments 


Let us see how momentum works in practice, i.e., when used within the context of a proper opti- 
mizer. For this we need a somewhat more scalable implementation. 


Implementation from Scratch 


Compared with (minibatch) SGD the momentum method needs to maintain a set of auxiliary vari- 
ables, i.e., velocity. It has the same shape as the gradients (and variables of the optimization prob- 
lem). In the implementation below we call these variables states. 


def init_momentum_states(feature_dim): 
v_w = np.zeros((feature_dim, 1)) 
v_b = np.zeros(1) 
return (v_w, v_b) 


def sgd_momentum(params, states, hyperparams): 
for p, v in zip(params, states): 
v[:] = hyperparams[ 'momentum'] * v + p.grad 
pL:] -= hyperparams['1r'] x v 


Let us see how this works in practice. 


def train_momentum(1r, momentum, num_epochs=2): 
d21.train_ch11(sgd_momentum, init_momentum_states(feature_dim), 
('1r': Ir, 'momentum': momentum}, data_iter, 
feature_dim, num_epochs) 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
train_momentum(0.02, 0.5) 


loss: 0.243, 0.068 sec/epoch 
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When we increase the momentum hyperparameter momentum to 0.9, it amounts to a significantly 
larger effective sample size of a = 10. We reduce the learning rate slightly to 0.01 to keep 
matters under control. 


train_momentum(0.01, 0.9) 


loss: 0.256, 0.068 sec/epoch 
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Reducing the learning rate further addresses any issue of non-smooth optimization problems. 
Setting it to 0.005 yields good convergence properties. 


train_momentum(09.005, 0.9) 


loss: 0.250, 0.066 sec/epoch 
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Concise Implementation 


There is very little to do in Gluon since the standard sgd solver already had momentum built in. 
Setting matching parameters yields a very similar trajectory. 


d21.train_concise_chl1('sgd’, {'’learning_rate’: 0.005, 'momentum': 0.9), 
data_iter) 


loss: 0.246, 0.049 sec/epoch 
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11.6.3 Theoretical Analysis 


So far the 2D example of f(x) = 0.11% + 2x3 seemed rather contrived. We will now see that this is 
actually quite representative of the types of problem one might encounter, at least in the case of 
minimizing convex quadratic objective functions. 





11.6. Momentum 481 


Quadratic Convex Functions 


Consider the function 
1 
h(x) = zx Ox +x'e+b. (11.6.6) 


This is a general quadratic function. For positive definite matrices Q > 0, i.e., for matrices with 
positive eigenvalues this has a minimizer at x* = —Q7 le with minimum value b— 5c'Q7!c. Hence 
we can rewrite h as 


h(x) = = (x -Q-'e)'Q(x-—Q™'e) +b — LeTo le. (11.6.7) 


NI = 


The gradient is given by 0, f(x) = Q(x — Q7!c). That is, it is given by the distance between x 
and the minimizer, multiplied by Q. Consequently also the momentum is a linear combination of 
terms Q(x; — Q-'c). 


Since Q is positive definite it can be decomposed into its eigensystem via Q = O' AO for an or- 
thogonal (rotation) matrix O and a diagonal matrix A of positive eigenvalues. This allows us to 
perform a change of variables from x to z := O(x — Q- 'c) to obtain a much simplified expression: 


h(z) = se A +v. (11.6.8) 


Here d” = b — ¿e'Q7!c. Since O is only an orthogonal matrix this does not perturb the gradients 
in a meaningful way. Expressed in terms of z gradient descent becomes 


Zt = 4-1 — AZ = (I = A)Z-1.- (11.6.9) 


The important fact in this expression is that gradient descent does not mix between different 
eigenspaces. That is, when expressed in terms of the eigensystem of Q the optimization prob- 
lem proceeds in a coordinate-wise manner. This also holds for momentum. 


Vi = Vii + AZ 
Zt = “1-1 (BVi-1 + AZ_-1) (11.6.10) 
= (I— nA)z_1 — NOVi-1- 


In doing this we just proved the following theorem: Gradient Descent with and without momen- 
tum for a convex quadratic function decomposes into coordinate-wise optimization in the direc- 
tion of the eigenvectors of the quadratic matrix. 


Scalar Functions 


Given the above result let us see what happens when we minimize the function f(x) = 327. For 


gradient descent we have 
Tity = Le — NAT, = (1 — nA). (11.6.11) 


Whenever |1 — yA] < 1 this optimization converges at an exponential rate since after t steps we 
have x; = (1 — 9A)'x0. This shows how the rate of convergence improves initially as we increase 
the learning rate 7 until yA = 1. Beyond that things diverge and for nA > 2 the optimization 
problem diverges. 
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lambdas = [0.1, 1, 10, 19] 
eta = 0.1 
d21.set_figsize((6, 4)) 
for lam in lambdas: 
t = np.arange(20).asnumpy() 
d21.p1t.plot(t, (1 - eta * lam)**t, label=f'lambda = (lam: .2f}’) 
d21.plt.xlabel(’ time’) 
d21.plt.legend() 


<matplotlib.legend.Legend at 0x7f4c5c074bb0> 
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To analyze convergence in the case of momentum we begin by rewriting the update equations in 
terms of two scalars: one for x and one for the momentum v. This yields: 


Edel p ha a | E =R(8,1,A) la (11.6.12) 


We used R to denote the 2 x 2 governing convergence behavior. After t steps the initial choice 
[vo, Xo] becomes R(B, n, A)'[vo, xo]. Hence, it is up to the eigenvalues of R to determine the speed 
of convergence. See the Distill post! of (Goh, 2017) for a great animation and (Flammarion & 
Bach, 2015) for a detailed analysis. One can show that 0 < nA < 2 + 28 momentum converges. 
This is a larger range of feasible parameters when compared to 0 < 7A < 2 for gradient descent. It 
also suggests that in general large values of 5 are desirable. Further details require a fair amount 
of technical detail and we suggest that the interested reader consult the original publications. 


F 








138 https://distill.pub/2017/momentum/ 
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Summary 


Momentum replaces gradients with a leaky average over past gradients. This accelerates 
convergence significantly. 


It is desirable for both noise-free gradient descent and (noisy) stochastic gradient descent. 


Momentum prevents stalling of the optimization process that is much more likely to occur 
for stochastic gradient descent. 


The effective number of gradients is given by 3 due to exponentiated downweighting of 
past data. 


In the case of convex quadratic problems this can be analyzed explicitly in detail. 


Implementation is quite straightforward but it requires us to store an additional state vector 
(momentum v). 


Exercises 


1. Use other combinations of momentum hyperparameters and learning rates and observe and 
analyze the different experimental results. 


2. Try out GD and momentum for a quadratic problem where you have multiple eigenvalues, 
ie., f(x) = $0; Aiz?, e.g., A; = 27. Plot how the values of x decrease for the initialization 
Tí = 1. 


3. Derive minimum value and minimizer for h(x) = ¿x'Qx+x'c +b. 


4. What changes when we perform SGD with momentum? What happens when we use mini- 
batch SGD with momentum? Experiment with the parameters? 


Discussions??? 


11.7 Adagrad 


Let us begin by considering learning problems with features that occur infrequently. 


11.7.1 Sparse Features and Learning Rates 


Imagine that we are training a language model. To get good accuracy we typically want to decrease 
the learning rate as we keep on training, usually at a rate of O(t73) or slower. Now consider a 
model training on sparse features, i.e., features that occur only infrequently. This is common for 
natural language, e.g., it is a lot less likely that we will see the word preconditioning than learning. 
However, it is also common in other areas such as computational advertising and personalized 
collaborative filtering. After all, there are many things that are of interest only for a small number 
of people. 


Parameters associated with infrequent features only receive meaningful updates whenever these 
features occur. Given a decreasing learning rate we might end up in a situation where the parame- 
ters for common features converge rather quickly to their optimal values, whereas for infrequent 





132 https://discuss.d21.ai/t/354 
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features we are still short of observing them sufficiently frequently before their optimal values can 
be determined. In other words, the learning rate either decreases too slowly for frequent features 
or too quickly for infrequent ones. 


A possible hack to redress this issue would be to count the number of times we see a particular 
feature and to use this as a clock for adjusting learning rates. That is, rather than choosing a 





‘ — o et gs ™o 
learning rate of the form 7 = Tire We could use 7; = aes Here s(i,t) counts the number of 


nonzeros for feature i that we have observed up to time t. This is actually quite easy to implement 
at no meaningful overhead. However, it fails whenever we do not quite have sparsity but rather 
just data where the gradients are often very small and only rarely large. After all, it is unclear 
where one would draw the line between something that qualifies as an observed feature or not. 


Adagrad by (Duchi et al., 2011) addresses this by replacing the rather crude counter s(i,t) by an 
aggregate of the squares of previously observed gradients. In particular, it uses s(i,t +1) = 
s(i, t) + (0,f(x))? as a means to adjust the learning rate. This has two benefits: first, we no longer 
need to decide just when a gradient is large enough. Second, it scales automatically with the mag- 
nitude of the gradients. Coordinates that routinely correspond to large gradients are scaled down 
significantly, whereas others with small gradients receive a much more gentle treatment. In prac- 
tice this leads to a very effective optimization procedure for computational advertising and related 
problems. But this hides some of the additional benefits inherent in Adagrad that are best under- 
stood in the context of preconditioning. 


11.7.2 Preconditioning 


Convex optimization problems are good for analyzing the characteristics of algorithms. After all, 
for most nonconvex problems it is difficult to derive meaningful theoretical guarantees, but intu- 
ition and insight often carry over. Let us look at the problem of minimizing f(x) = 5x' Qx+e'x+b. 


As we saw in Section 11.6, it is possible to rewrite this problem in terms of its eigendecompo- 
sition Q = U' AU to arrive at a much simplified problem where each coordinate can be solved 
individually: 


fx) = f3) = SEAR + Ex b. (11.7.1) 


Here we used x = Ux and consequently c = Uc. The modified problem has as its minimizer 
x = —A`'ē and minimum value — Je "Ate + b. This is much easier to compute since A is a 
diagonal matrix containing the eigenvalues of Q. 


If we perturb c slightly we would hope to find only slight changes in the minimizer of f. Unfortu- 
nately this is not the case. While slight changes in c lead to equally slight changes in ¢, this is not 
the case for the minimizer of f (and of f respectively). Whenever the eigenvalues A; are large we 
will see only small changes in z; and in the minimum of f. Conversely, for small A; changes in 7; 
can be dramatic. The ratio between the largest and the smallest eigenvalue is called the condition 
number of an optimization problem. 


= 11.7.2 
x (11.7.2) 


K 
If the condition number « is large, it is difficult to solve the optimization problem accurately. We 
need to ensure that we are careful in getting a large dynamic range of values right. Our analysis 
leads to an obvious, albeit somewhat naive question: couldn't we simply “fix” the problem by dis- 
torting the space such that all eigenvalues are 1. In theory this is quite easy: we only need the 
eigenvalues and eigenvectors of Q to rescale the problem from x to one in z := A3Ux. In the 
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new coordinate system x! Qx could be simplified to ||z||?. Alas, this is a rather impractical sugges- 
tion. Computing eigenvalues and eigenvectors is in general much more expensive than solving the 
actual problem. 


While computing eigenvalues exactly might be expensive, guessing them and computing them 
even somewhat approximately may already be a lot better than not doing anything at all. In par- 
ticular, we could use the diagonal entries of Q and rescale it accordingly. This is much cheaper 
than computing eigenvalues. 


Q = diag 2(Q)Qdiag”2(0). (11.7.3) 


In this case we have Qi; = Q:;//QiQ;; and specifically Q; = 1 for all i. In most cases this 
simplifies the condition number considerably. For instance, the cases we discussed previously, 
this would entirely eliminate the problem at hand since the problem is axis aligned. 


Unfortunately we face yet another problem: in deep learning we typically do not even have access 
to the second derivative of the objective function: for x € R? the second derivative even on a 
minibatch may require O(d?) space and work to compute, thus making it practically infeasible. 
The ingenious idea of Adagrad is to use a proxy for that elusive diagonal of the Hessian that is both 
relatively cheap to compute and effective—the magnitude of the gradient itself. 


In order to see why this works, let us look at f(x). We have that 
Of (A) = AX +E = A (X — X0), (11.7.4) 


where Xy is the minimizer of f. Hence the magnitude of the gradient depends both on A and the 
distance from optimality. If x — xy didn’t change, this would be all that’s needed. After all, in this 
case the magnitude of the gradient 0, f (x) suffices. Since AdaGrad is a stochastic gradient descent 
algorithm, we will see gradients with nonzero variance even at optimality. As a result we can safely 
use the variance of the gradients as a cheap proxy for the scale of the Hessian. A thorough analysis 
is beyond the scope of this section (it would be several pages). We refer the reader to (Duchi et al., 
2011) for details. 


11.7.3 The Algorithm 


Let us formalize the discussion from above. We use the variable s; to accumulate past gradient 
variance as follows. 


gi = Owl (yt, f (Xt, W)), 





St = $1 + 87, (11.7.5) 
Wi = Wi-1 — eS E St- 
t 


Here the operation are applied coordinate wise. That is, v? has entries v?. Likewise Gh has entries 
Tn and u - v has entries u;v;. As before 7 is the learning rate and e is an additive constant that 


ensures that we do not divide by 0. Last, we initialize sy = 0. 


Just like in the case of momentum we need to keep track of an auxiliary variable, in this case to 
allow for an individual learning rate per coordinate. This does not increase the cost of Adagrad 
significantly relative to SGD, simply since the main cost is typically to compute l(y+, f(x;,w)) and 
its derivative. 


Note that accumulating squared gradients in s; means that s; grows essentially at linear rate (some- 
what slower than linearly in practice, since the gradients initially diminish). This leads to an 
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O(t73) learning rate, albeit adjusted on a per coordinate basis. For convex problems this is per- 
fectly adequate. In deep learning, though, we might want to decrease the learning rate rather 
more slowly. This led to a number of Adagrad variants that we will discuss in the subsequent 
chapters. For now let us see how it behaves in a quadratic convex problem. We use the same 
problem as before: 


f(x) = 0.12? + 222. (11.7.6) 


We are going to implement Adagrad using the same learning rate previously, i.e., 7 = 0.4. As 
we can see, the iterative trajectory of the independent variable is smoother. However, due to the 
cumulative effect of s,, the learning rate continuously decays, so the independent variable does 
not move as much during later stages of iteration. 


%matplotlib inline 

from d21 import mxnet as d21 
import math 

from mxnet import np, npx 
npx.set_np() 


def adagrad_2d(x1, x2, s1, s2): 
eps = le-6 
ole? = 0-2 a x1, 4 + 
s1 += gl xx 2 
s2 += g2 xx 2 
x1 -= eta / math.sqrt(sl + eps) * gl 
x2 -= eta / math.sqrt(s2 + eps) * g2 
RETURN CAZAS 2 


def f_2d(x1, x2): 
return ail es Fal ae LLL 


eta = 0.4 
d21.show_trace_2d(f_2d, d21.train_2d(adagrad_2d)) 


As we increase the learning rate to 2 we see much better behavior. This already indicates that the 
decrease in learning rate might be rather aggressive, even in the noise-free case and we need to 
ensure that parameters converge appropriately. 
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eta = 2 
d21.show_trace_2d(f_2d, d21.train_2d(adagrad_2d)) 








11.7.4 Implementation from Scratch 


Just like the momentum method, Adagrad needs to maintain a state variable of the same shape as 
the parameters. 


def init_adagrad_states(feature_dim): 
s_w = np.zeros((feature_dim, 1)) 
s_b = np.zeros(1) 


return (s_w, s_b) 


def adagrad(params, states, hyperparams): 
eps = le-6 
for p, s in zip(params, states): 
s[:] += np.square(p.grad) 
pL:] -= hyperparams['1r'] * p.grad / np.sqrt(s + eps) 


Compared to the experiment in Section 11.5 we use a larger learning rate to train the model. 
data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 


d21.train_ch11(adagrad, init_adagrad_states(feature_dim) , 
{'lr': 0.1), data_iter, feature_dim); 


loss: 0.243, 0.086 sec/epoch 
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11.7.5 Concise Implementation 


Using the Trainer instance of the algorithm adagrad, we can invoke the Adagrad algorithm in 
Gluon. 


d21.train_concise_ch11('adagrad', {'learning_rate': 0.1), data_iter) 


loss: 0.243, 0.089 sec/epoch 
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Summary 


Adagrad decreases the learning rate dynamically on a per-coordinate basis. 


It uses the magnitude of the gradient as a means of adjusting how quickly progress is 
achieved - coordinates with large gradients are compensated with a smaller learning rate. 


Computing the exact second derivative is typically infeasible in deep learning problems due 
to memory and computational constraints. The gradient can be a useful proxy. 


If the optimization problem has a rather uneven structure Adagrad can help mitigate the 
distortion. 
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e Adagrad is particularly effective for sparse features where the learning rate needs to decrease 
more slowly for infrequently occurring terms. 


+ On deep learning problems Adagrad can sometimes be too aggressive in reducing learning 
rates. We will discuss strategies for mitigating this in the context of Section 11.10. 


Exercises 


1. Prove that for an orthogonal matrix U and a vector c the following holds: ||e — 6||2 = ||Ue — 
Uô||2. Why does this mean that the magnitude of perturbations does not change after an 
orthogonal change of variables? 


2. Try out Adagrad for f(x) = 0.12? + 223 and also for the objective function was rotated by 45 
degrees, i.e., f(x) = 0.1(x1 + 22)? + 2(x1 — x2)”. Does it behave differently? 





3. Prove Gerschgorin’s circle theorem? which states that eigenvalues A; of a matrix M satisfy 
|Ai — Mjj| < > 415 [Mjx| for at least one choice of j. 


4, What does Gerschgorin’s theorem tell us about the eigenvalues of the diagonally precondi- 
tioned matrix diag” 2(M)Mdiag ? (M)? 


5. Try out Adagrad for a proper deep network, such as Section 6.6 when applied to Fashion 
MNIST. 


6. How would you need to modify Adagrad to achieve a less aggressive decay in learning rate? 


Discussions!*! 


11.8 RMSProp 


One of the key issues in Section 11.7 is that the learning rate decreases at a predefined schedule 
of effectively O(t-2). While this is generally appropriate for convex problems, it might not be 
ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise 
adaptivity of Adagrad is highly desirable as a preconditioner. 


(Tieleman & Hinton, 2012) proposed the RMSProp algorithm as a simple fix to decouple rate 
scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the 
squares of the gradient g; into a state vector s; = s;_1 +g?. As a result s; keeps on growing without 
bound due to the lack of normalization, essentially linearly as the algorithm converges. 


One way of fixing this problem would be to use s;/t. For reasonable distributions of g, this will 
converge. Unfortunately it might take a very long time until the limit behavior starts to matter 
since the procedure remembers the full trajectory of values. An alternative is to use a leaky av- 
erage in the same way we used in the momentum method, i.e., St + ys:_1 + (1 — y)g? for some 
parameter y > 0. Keeping all other parts unchanged yields RMSProp. 





40 https://en.wikipedia.org/wiki/Gershgorin_circle_theorem 
1# https://discuss.d21.ai/t/355 
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11.8.1 The Algorithm 


Let us write out the equations in detail. 


St = ysi-1 + (1— BF, 
n (11.8.1) 


Je i 


The constant e > 0 is typically set to 1076 to ensure that we do not suffer from division by zero 
or overly large step sizes. Given this expansion we are now free to control the learning rate 7 
independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages 
we can apply the same reasoning as previously applied in the case of the momentum method. 
Expanding the definition of s; yields 





Xt — Xt_-1 


St = (1 — 1)8% + 1811 


(11.8.2) 
= (1-7) (g? + g1 +g +.) 


As before in Section 11.6 we use 1 + y+y*+...,= = Hence the sum of weights is normalized 
to 1 with a half-life time of an observation of 771. Let us visualize the weights for the past 40 time 


steps for various choices of y. 


%matplotlib inline 

from d21 import mxnet as d21 
import math 

from mxnet import np, npx 


npx.set_np() 


d21.set_figsize() 
gammas = [0.95, 0.9, 0.8, 0.7] 
for gamma in gammas: 
x = np.arange(40).asnumpy() 
d21.p1t.plot(x, (1 - gamma) * gammax*x, label=f'gamma = (gamma: .2f}') 
d21.plt.xlabel(’ time’) 


Text(0.5, 0, ‘'time’) 
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11.8.2 Implementation from Scratch 


As before we use the quadratic function f(x) = 0.11% + 2x3 to observe the trajectory of RMSProp. 
Recall that in Section 11.7, when we used Adagrad with a learning rate of 0.4, the variables moved 
only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. 
Since 7 is controlled separately this does not happen with RMSProp. 


def rmsprop_2d(x1, x2, sl, s2): 
gl, g2, eps = 0.2 x xl, 4 x x2, le-6 
sl = gamma * s1 + (1 - gamma) * gl xx 2 
s2 = gamma * s2 + (1 - gamma) * g2 xx 2 
x1 -= eta / math.sqrt(sl + eps) * gl 
x2 -= eta / math.sqrt(s2 + eps) * g2 
RETURNS IES 2 


def f_2d(x1, x2): 
return 0.1 x x1 xx 2+ 2 * x2 xx 2 


eta, gamma = 0.4, 0.9 
d21.show_trace_2d(f_2d, d21.train_2d(rmsprop_2d)) 








Next, we implement RMSProp to be used in a deep network. This is equally straightforward. 


def init_rmsprop_states(feature_dim): 
s_w = np.zeros((feature_dim, 1)) 
s_b = np.zeros(1) 
return (s_w, s_b) 


def rmsprop(params, states, hyperparams): 
gamma, eps = hyperparams[’gamma'], le-6 
for p, s in zip(params, states): 
s[:] = gamma x s + (1 - gamma) * np.square(p. grad) 
pL:] -= hyperparams['1r'] * p.grad / np.sqrt(s + eps) 


We set the initial learning rate to 0.01 and the weighting term y to 0.9. That is, s aggregates on 
average over the past 1/(1 — y) = 10 observations of the square gradient. 
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data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_ch11(rmsprop, init_rmsprop_states(feature_dim), 
('1r': 0.01, 'gamma': 0.9), data_iter, feature_dim); 


loss: 0.245, 0.086 sec/epoch 
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11.8.3 Concise Implementation 


Since RMSProp is a rather popular algorithm it is also available in the Trainer instance. All we 
need to do is instantiate it using an algorithm named rmsprop, assigning y to the parameter gammal. 


d21.train_concise_ch11('rmsprop', f'learning_rate': 0.01, 'gammal’: 0.9), 
data_iter) 


loss: 0.245, 0.050 sec/epoch 
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Summary 


RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale 
coefficients. 


RMSProp shares with momentum the leaky averaging. However, RMSProp uses the tech- 
nique to adjust the coefficient-wise preconditioner. 


The learning rate needs to be scheduled by the experimenter in practice. 


The coefficient y determines how long the history is when adjusting the per-coordinate 
scale. 


Exercises 


1. What happens experimentally if we set y = 1? Why? 





2. Rotate the optimization problem to minimize f(x) = 0.1(%1 + 22)? + 2(11 — 22)?. What 
happens to the convergence? 


3. Try out what happens to RMSProp on a real machine learning problem, such as training on 
Fashion-MNIST. Experiment with different choices for adjusting the learning rate. 


4. Would you want to adjust y as optimization progresses? How sensitive is RMSProp to this? 


Discussions!* 


11.9 Adadelta 


Adadelta is yet another variant of AdaGrad (Section 11.7). The main difference lies in the fact that 
it decreases the amount by which the learning rate is adaptive to coordinates. Moreover, tradi- 
tionally it referred to as not having a learning rate since it uses the amount of change itself as 
calibration for future change. The algorithm was proposed in (Zeiler, 2012). It is fairly straight- 
forward, given the discussion of previous algorithms so far. 


11.9.1 The Algorithm 


In a nutshell, Adadelta uses two state variables, s, to store a leaky average of the second moment 
of the gradient and Ax; to store a leaky average of the second moment of the change of parameters 
in the model itself. Note that we use the original notation and naming of the authors for compati- 
bility with other publications and implementations (there is no other real reason why one should 
use different Greek variables to indicate a parameter serving the same purpose in momentum, 
Adagrad, RMSProp, and Adadelta). 


Here are the technical details of Adadelta. Given the parameter du jour is p, we obtain the follow- 
ing leaky updates similarly to Section 11.8: 


St = pSi_1 + (1 — p)g?. (11.9.1) 
The difference to Section 11.8 is that we perform updates with the rescaled gradient gi, i.e., 


Xy = X-1 — gi. (11.9.2) 





1# https://discuss.d21.ai/t/356 
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So what is the rescaled gradient g}? We can calculate it as follows: 


VAx:-1 + € (11.9.3) 


where Ax;_; is the leaky average of the squared rescaled gradients g}. We initialize Axo to be 0 
and update it at each step with g;, i.e., 


Ax; = pAx_1 + (1— p)g/”, (11.9.4) 


and e (a small value such as 107?) is added to maintain numerical stability. 


11.9.2 Implementation 


Adadelta needs to maintain two state variables for each variable, s; and Ax;. This yields the fol- 
lowing implementation. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


def init_adadelta_states(feature_dim) : 
s_w, s_b = np.zeros((feature_dim, 1)), np.zeros(1) 
delta_w, delta_b = np.zeros((feature_dim, 1)), np.zeros(1) 
return ((s_w, delta_w), (s_b, delta_b)) 


def adadelta(params, states, hyperparams): 

rho, eps = hyperparams[’rho’], 1e-5 

for p, (s, delta) in zip(params, states): 
# In-place updates via [:] 
s[:] = rho xs + (1 - rho) * np.square(p. grad) 
g = (np.sqrt(delta + eps) / np.sqrt(s + eps)) * p.grad 
pl:] -= g 
delta[:] = rho x delta + (1 - rho) * g * g 


Choosing p = 0.9 amounts to a half-life time of 10 for each parameter update. This tends to work 
quite well. We get the following behavior. 


data_iter, feature_dim = d2l.get_data_ch11(batch_size=10) 


d2l.train_ch11(adadelta, init_adadelta_states(feature_dim), 
{'rho': 0.9), data_iter, feature_dim); 


loss: 0.243, 0.273 sec/epoch 
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For a concise implementation we simply use the adadelta algorithm from the Trainer class. This 
yields the following one-liner for a much more compact invocation. 


d21.train_concise_ch11('adadelta', {’rho': 0.9), data_iter) 


loss: 0.244, 0.267 sec/epoch 
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Summary 
e Adadelta has no learning rate parameter. Instead, it uses the rate of change in the parame- 
ters itself to adapt the learning rate. 


e Adadelta requires two state variables to store the second moments of gradient and the 
change in parameters. 


e Adadelta uses leaky averages to keep a running estimate of the appropriate statistics. 
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Exercises 


1. Adjust the value of p. What happens? 
2. Show how to implement the algorithm without the use of g}. Why might this be a good idea? 


3. Is Adadelta really learning rate free? Could you find optimization problems that break 
Adadelta? 


4. Compare Adadelta to Adagrad and RMS prop to discuss their convergence behavior. 


Discussions!’ 


11.10 Adam 


In the discussions leading up to this section we encountered a number of techniques for efficient 
optimization. Let us recap them in detail here: 


e We saw that Section 11.4 is more effective than Gradient Descent when solving optimization 
problems, e.g., due to its inherent resilience to redundant data. 


e We saw that Section 11.5 affords significant additional efficiency arising from vectorization, 
using larger sets of observations in one minibatch. This is the key to efficient multi-machine, 
multi-GPU and overall parallel processing. 


e Section 11.6 added a mechanism for aggregating a history of past gradients to accelerate 
convergence. 


e Section 11.7 used per-coordinate scaling to allow for a computationally efficient precondi- 
tioner. 


e Section 11.8 decoupled per-coordinate scaling from a learning rate adjustment. 


Adam (Kingma & Ba, 2014) combines all these techniques into one efficient learning algorithm. 
As expected, this is an algorithm that has become rather popular as one of the more robust and 
effective optimization algorithms to use in deep learning. It is not without issues, though. In 
particular, (Reddi et al., 2019) show that there are situations where Adam can diverge due to poor 
variance control. In a follow-up work (Zaheer et al., 2018) proposed a hotfix to Adam, called Yogi 
which addresses these issues. More on this later. For now let us review the Adam algorithm. 


11.10.1 The Algorithm 


One of the key components of Adam is that it uses exponential weighted moving averages (also 
known as leaky averaging) to obtain an estimate of both the momentum and also the second mo- 
ment of the gradient. That is, it uses the state variables 


Vi + B1Ve1 + (1 — B1)8s, 


: (1.10.1) 
St = B28:-1 + (1 — B2)gf. 


Here 3, and (2 are nonnegative weighting parameters. Common choices for them are 3, = 0.9 
and 62 = 0.999. That is, the variance estimate moves much more slowly than the momentum term. 
Note that if we initialize vp = Sy = 0 we have a significant amount of bias initially towards smaller 
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values. This can be addressed by using the fact that yy 6 = 15 to re-normalize terms. Cor- 
respondingly the normalized state variables are given by 

Á Vi x S 

Vv; = —— and s; = ——. 11.10.2 
1-H A RÓS 
Armed with the proper estimates we can now write outthe update equations. First, we rescale the 
gradient in a manner very much akin to that of RMSProp to obtain 


g =D. 

Vi +e 

Unlike RMSProp our update uses the momentum v;, rather than the gradient itself. Morsoven 

there is a slight cosmetic difference as the rescaling happens using Ten instead of . The 

former works arguably slightly better in practice, hence the deviation from RMSProp. "Typically 
we pick e = 10° for a good trade-off between numerical stability and fidelity. 


(11.10.3) 








Now we have all the pieces in place to compute updates. This is slightly anticlimactic and we have 
a simple update of the form 


Xy << X¿-1 — gi. (11.10.4) 


Reviewing the design of Adam its inspiration is clear. Momentum and scale are clearly visible in 
the state variables. Their rather peculiar definition forces us to debias terms (this could be fixed 
by a slightly different initialization and update condition). Second, the combination of both terms 
is pretty straightforward, given RMSProp. Last, the explicit learning rate 7 allows us to control the 
step length to address issues of convergence. 


11.10.2 Implementation 


Implementing Adam from scratch is not very daunting. For convenience we store the time step 
counter t in the hyperparams dictionary. Beyond that all is straightforward. 


%matplotlib inline 

from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 


def init_adam_states(feature_dim) : 
v_w, v_b = np.zeros((feature_dim, 1)), np.zeros(1) 
s_w, s_b = np.zeros((feature_dim, 1)), np.zeros(1) 


return ((v_w, s_w), (v_b, s_b)) 





def adam(params, states, hyperparams): 
betal, beta2, eps = 0.9, 0.999, le-6 
for p, (v, s) in zip(params, states): 
v[:] = betal x v + (1 - betal) * p.grad 
s[:] = beta2 * s + (1 - beta2) * np.square(p. grad) 
v_bias_corr = v / (1 - betal xx hyperparams['t']) 
s_bias_corr = s / (1 - beta2 x» hyperparams['t']) 
pL:] -= hyperparams['1r'] * v_bias_corr / (np.sqrt(s_bias_corr) + eps) 
hyperparams['t'] += 1 


We are ready to use Adam to train the model. We use a learning rate of y = 0.01. 
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data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_ch11(adam, init_adam_states(feature_dim), 
{'Ir': 0.01, 't': 1), data_iter, feature dim); 


loss: 0.243, 0.114 sec/epoch 
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A more concise implementation is straightforward since adam is one of the algorithms provided 
as part of the Gluon trainer optimization library. Hence we only need to pass configuration pa- 
rameters for an implementation in Gluon. 


d21.train_concise_chl1('adam’, £'learning_rate': 0.01), data_iter) 


loss: 0.244, 0.051 sec/epoch 
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11.10.3 Yogi 


One of the problems of Adam is that it can fail to converge even in convex settings when the second 
moment estimate in s; blows up. As a fix (Zaheer et al., 2018) proposed a refined update (and 
initialization) for s;. To understand what's going on, let us rewrite the Adam update as follows: 


S: + 8:1 + (1 — Ba) (g? — St-1) - (11.10.5) 


Whenever g? has high variance or updates are sparse, s might forget past values too quickly. A 
possible fix for this is to replace g? — s,_, by g? © sgn(g? — s;_1). Now the magnitude of the update 
no longer depends on the amount of deviation. This yields the Yogi updates 


S$; E si-1 + (1 — B2)g? © sgn(g? — s;_1). (11.10.6) 


The authors furthermore advise to initialize the momentum on a larger initial batch rather than 
just initial pointwise estimate. We omit the details since they are not material to the discussion 
and since even without this convergence remains pretty good. 


def yogi(params, states, hyperparams): 
betal, beta2, eps = 0.9, 0.999, le-3 
for p, (v, s) in zip(params, states): 
v[:] = betal x v + (1 - betal) * p.grad 
s[:] = s + (1 - beta2) * np.sign( 
np.square(p.grad) - s) * np.square(p.grad) 
v_bias_corr = v / (1 - betal xx hyperparams['t']) 
s_bias_corr = s / (1 - beta2 x» hyperparams['t']) 
pL:] -= hyperparams['1r'] * v_bias_corr / (np.sqrt(s_bias_corr) + eps) 
hyperparams['t'] += 1 


data_iter, feature_dim = d21.get_data_ch11(batch_size=10) 
d21.train_ch11(yogi, init_adam_states(feature_dim), 
('1r': 0.01, 't': 1), data_iter, feature_dim); 


loss: 0.243, 0.124 sec/epoch 


0.350 
0.325 


0.300 


loss 


0.275 
0.250 


0.225 


0.0 0.5 1.0 1.5 2.0 
epoch 





500 Chapter 11. Optimization Algorithms 


Summary 


Adam combines features of many optimization algorithms into a fairly robust update rule. 


Created on the basis of RMSProp, Adam also uses EWMA on the minibatch stochastic gradi- 
ent. 


Adam uses bias correction to adjust for a slow startup when estimating momentum and a 
second moment. 


For gradients with significant variance we may encounter issues with convergence. They 
can be amended by using larger minibatches or by switching to an improved estimate for s+. 
Yogi offers such an alternative. 


Exercises 


. Adjust the learning rate and observe and analyze the experimental results. 


. Can you rewrite momentum and second moment updates such that it does not require bias 


3. 
4, 


correction? 
Why do you need to reduce the learning rate 1 as we converge? 


Try to construct a case for which Adam diverges and Yogi converges? 


Discussions! 


11.11 Learning Rate Scheduling 


So far we primarily focused on optimization algorithms for how to update the weight vectors rather 
than on the rate at which they are being updated. Nonetheless, adjusting the learning rate is often 
just as important as the actual algorithm. There are a number of aspects to consider: 


Most obviously the magnitude of the learning rate matters. If it is too large, optimization 
diverges, if it is too small, it takes too long to train or we end up with a suboptimal result. 
We saw previously that the condition number of the problem matters (see e.g., Section 11.6 
for details). Intuitively it is the ratio of the amount of change in the least sensitive direction 
vs. the most sensitive one. 


Secondly, the rate of decay is just as important. If the learning rate remains large we may 
simply end up bouncing around the minimum and thus not reach optimality. Section 11.5 
discussed this in some detail and we analyzed performance guarantees in Section 11.4. In 
short, we want the rate to decay, but probably more slowly than O(t-2) which would be a 
good choice for convex problems. 


Another aspect that is equally important is initialization. This pertains both to how the pa- 
rameters are set initially (review Section 4.8 for details) and also how they evolve initially. 
This goes under the moniker of warmup, i.e., how rapidly we start moving towards the so- 
lution initially. Large steps in the beginning might not be beneficial, in particular since the 
initial set of parameters is random. The initial update directions might be quite meaning- 
less, too. 





™ https://discuss.d21.ai/t/358 
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e Lastly, there are a number of optimization variants that perform cyclical learning rate ad- 
justment. This is beyond the scope of the current chapter. We recommend the reader to 
review details in (Izmailov et al., 2018), e.g., how to obtain better solutions by averaging 
over an entire path of parameters. 


Given the fact that there is a lot of detail needed to manage learning rates, most deep learning 
frameworks have tools to deal with this automatically. In the current chapter we will review the 
effects that different schedules have on accuracy and also show how this can be managed effi- 
ciently via a learning rate scheduler. 


11.11.1 Toy Problem 


We begin with a toy problem that is cheap enough to compute easily, yet sufficiently nontrivial to 
illustrate some of the key aspects. For that we pick a slightly modernized version of LeNet (relu 
instead of sigmoid activation, MaxPooling rather than AveragePooling), as applied to Fashion- 
MNIST. Moreover, we hybridize the network for performance. Since most of the code is standard 
we just introduce the basics without further detailed discussion. See Chapter 6 for a refresher as 
needed. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, lr_scheduler, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


net = nn.HybridSequential() 

net.add(nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='relu'), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Conv2D(channels=16, kernel_size=5, activation='relu'), 
nn.MaxPool2D(pool_size=2, strides=2), 
nn.Dense(120, activation='relu'), 
nn.Dense(84, activation='relu’), 
nn.Dense(10)) 

net.hybridize() 

loss = gluon.loss.SoftmaxCrossEntropyLoss() 

device = d21.try_gpu() 


batch_size = 256 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size=batch_size) 


# The code is almost identical to ‘d21.train_ch6* defined in the 
# lenet section of chapter convolutional neural networks 
def train(net, train_iter, test_iter, num_epochs, loss, trainer, device): 
net. initialize(force_reinit=True, ctx=device, init=init.Xavier()) 
animator = d21.Animator(xlabel='epoch', xlim=[0, num_epochs], 
legend=['train loss', 'train acc’, ‘test acc’]) 
for epoch in range(num_epochs): 
metric = d21.Accumulator(3) + train_loss, train_acc, num_examples 
for i, (X, y) in enumerate(train_iter): 
X, y = X.as_in_ctx(device), y.as_in_ctx(device) 
with autograd.record(): 
y_hat = net(X) 
1 = loss(y_hat, y) 
1. backward() 


(continues on next page) 
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(continued from previous page) 


trainer.step(X.shape[0]) 
metric.add(1.sum(), d21.accuracy(y_hat, y), X.shape[0]) 
train_loss = metric[0] / metric[2] 
train_acc = metric[1] / metric[2] 
if (i + 1) % 50 == ð: 
animator.add(epoch + i / len(train_iter), 
(train_loss, train_acc, None)) 
test_acc = d21.evaluate_accuracy_gpu(net, test_iter) 
animator.add(epoch + 1, (None, None, test_acc)) 
print(f'train loss (train_loss:.3f), train acc {train_acc: .3f}, ' 
fatest acc (test acc .3f}’) 


Let us have a look at what happens if we invoke this algorithm with default settings, such as a 
learning rate of 0.3 and train for 30 iterations. Note how the training accuracy keeps on increas- 
ing while progress in terms of test accuracy stalls beyond a point. The gap between both curves 
indicates overfitting. 


lr, num_epochs = 0.3, 30 

net.initialize(force_reinit=True, ctx=device, init=init.Xavier()) 

trainer = gluon.Trainer(net.collect_params(), ‘sgd’, {'learning_rate’: 1r} 
train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.176, train acc 0.933, test acc 0.865 


—— train loss 
=== train acc 
—-- test acc 





epoch 


11.11.2 Schedulers 


One way of adjusting the learning rate is to set it explicitly at each step. This is conveniently 
achieved by the set_learning_rate method. We could adjust it downward after every epoch (or 
even after every minibatch), e.g., in a dynamic manner in response to how optimization is pro- 
gressing. 


trainer.set_learning_rate(0.1) 
print(f'learning rate is now (trainer.learning_rate:.2f)') 
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learning rate is now 0.10 


More generally we want to define a scheduler. When invoked with the number of updates it returns 
the appropriate value of the learning rate. Let us define a simple one that sets the learning rate to 


=il 
n=mp(t+1) 2. 
class SquareRootScheduler: 
def __init__(self, 1r=0.1): 
self.1r = 1r 


def __call__(self, num_update): 
return self.lr * pow(num_update + 1.0, -0.5) 


Let us plot its behavior over a range of values. 


scheduler = SquareRootScheduler(1r=0.1) 
d21.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs) ]) 
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Now let us see how this plays out for training on Fashion-MNIST. We simply provide the scheduler 
as an additional argument to the training algorithm. 


trainer = gluon.Trainer(net.collect_params(), 'sgd', 


£'1r_scheduler': scheduler}) 
train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.523, train acc 0.811, test acc 0.802 
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—— train loss 
=== train acc 
—-- test acc 





epoch 


This worked quite a bit better than previously. Two things stand out: the curve was rather more 
smooth than previously. Secondly, there was less overfitting. Unfortunately it is not a well- 
resolved question as to why certain strategies lead to less overfitting in theory. There is some 
argument that a smaller stepsize will lead to parameters that are closer to zero and thus simpler. 
However, this does not explain the phenomenon entirely since we do not really stop early but 
simply reduce the learning rate gently. 


11.11.3 Policies 


While we cannot possibly cover the entire variety of learning rate schedulers, we attempt to give 
a brief overview of popular policies below. Common choices are polynomial decay and piecewise 
constant schedules. Beyond that, cosine learning rate schedules have been found to work well 
empirically on some problems. Lastly, on some problems it is beneficial to warm up the optimizer 
prior to using large learning rates. 


Factor Scheduler 


One alternative to a polynomial decay would be a multiplicative one, that is m41 + m -a for 
a € (0,1). To prevent the learning rate from decaying beyond a reasonable lower bound the 
update equation is often modified to m4, + max(nmin, 7° a). 


class FactorScheduler: 
def __init__(self, factor=1, stop_factor_lr=1le-7, base_1r=0.1): 
self.factor = factor 
self.stop_factor_lr = stop_factor_lr 
self.base_1r = base_Ir 


def __call__(self, num_update): 
self.base_1r = max(self.stop_factor_1r, self.base_1r x self.factor) 
return self.base_1r 


scheduler = FactorScheduler(factor=0.9, stop_factor_1r=1e-2, base_1r=2.0) 
d21.plot(np.arange(50), [scheduler(t) for t in range(50)]) 
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This can also be accomplished by a built-in scheduler in MXNet via the 1r_scheduler. 
FactorScheduler object. It takes a few more parameters, such as warmup period, warmup mode 
(linear or constant), the maximum number of desired updates, etc.; Going forward we will use 
the built-in schedulers as appropriate and only explain their functionality here. As illustrated, it 
is fairly straightforward to build your own scheduler if needed. 


Multi Factor Scheduler 


A common strategy for training deep networks is to keep the learning rate piecewise constant and 
to decrease it by a given amount every so often. That is, given a set of times when to decrease the 
rate, such as s = (5, 10,20) decrease 7.41 — m - a whenever t € s. Assuming that the values are 
halved at each step we can implement this as follows. 


scheduler = 1r_scheduler.MultiFactorScheduler(step=[15, 30], factor=0.5, 


base_1r=0.5) 
d21.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)]) 
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The intuition behind this piecewise constant learning rate schedule is that one lets optimization 
proceed until a stationary point has been reached in terms of the distribution of weight vectors. 
Then (and only then) do we decrease the rate such as to obtain a higher quality proxy to a good 
local minimum. The example below shows how this can produce ever slightly better solutions. 





506 Chapter 11. Optimization Algorithms 


trainer = gluon.Trainer(net.collect_params(), 'sgd', 
£'1r_scheduler': scheduler}) 
train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.195, train acc 0.927, test acc 0.884 


—— train loss 
=== train acc 
—-- test acc 





epoch 


Cosine Scheduler 


A rather perplexing heuristic was proposed by (Loshchilov & Hutter, 2016). It relies on the obser- 
vation that we might not want to decrease the learning rate too drastically in the beginning and 
moreover, that we might want to “refine” the solution in the end using a very small learning rate. 
This results in a cosine-like schedule with the following functional form for learning rates in the 
range t € [0,7]. 


m = nr + “= (1 + cos(nt/T)) (11.11.1) 
Here no is the initial learning rate, nr is the target rate at time T. Furthermore, fort > T we simply 
pin the value to 77 without increasing it again. In the following example, we set the max update 
step T = 20. 


scheduler = 1r_scheduler.CosineScheduler(max_update=20, base_1r=0.3, 
final_1r=0.01) 
d21.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)]) 
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In the context of computer vision this schedule can lead to improved results. Note, though, that 
such improvements are not guaranteed (as can be seen below). 


trainer = gluon.Trainer(net.collect_params(), 'sgd”, 
£'1r_scheduler': scheduler}) 
train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.343, train acc 0.877, test acc 0.868 


—— train loss 
==- train acc 
—-- test acc 


oy a es EY e, 
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epoch 


Warmup 


In some cases initializing the parameters is not sufficient to guarantee a good solution. This par- 
ticularly a problem for some advanced network designs that may lead to unstable optimization 
problems. We could address this by choosing a sufficiently small learning rate to prevent diver- 
gence in the beginning. Unfortunately this means that progress is slow. Conversely, a large learn- 
ing rate initially leads to divergence. 


A rather simple fix for this dilemma is to use a warmup period during which the learning rate 
increases to its initial maximum and to cool down the rate until the end of the optimization process. 
For simplicity one typically uses a linear increase for this purpose. This leads to a schedule of the 
form indicated below. 
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scheduler = 1r_scheduler.CosineScheduler(20, warmup_steps=5, base_1r=0.3, 
final_1r=0.01) 
d21.plot(np.arange(num_epochs), [scheduler(t) for t in range(num_epochs)]) 
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Note that the network converges better initially (in particular observe the performance during the 
first 5 epochs). 


trainer = gluon.Trainer(net.collect_params(), 'sgd', 


£'1r_scheduler': scheduler}) 
train(net, train_iter, test_iter, num_epochs, loss, trainer, device) 


train loss 0.352, train acc 0.873, test acc 0.870 


— train loss 
=== train acc 
—-- test acc 





0 5 10 15 20 25 30 
epoch 


Warmup can be applied to any scheduler (not just cosine). For a more detailed discussion of learn- 
ing rate schedules and many more experiments see also (Gotmare et al., 2018). In particular they 
find that a warmup phase limits the amount of divergence of parameters in very deep networks. 
This makes intuitively sense since we would expect significant divergence due to random initial- 
ization in those parts of the network that take the most time to make progress in the beginning. 
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Summary 


Decreasing the learning rate during training can lead to improved accuracy and (most per- 
plexingly) reduced overfitting of the model. 


A piecewise decrease of the learning rate whenever progress has plateaued is effective in 
practice. Essentially this ensures that we converge efficiently to a suitable solution and only 
then reduce the inherent variance of the parameters by reducing the learning rate. 


Cosine schedulers are popular for some computer vision problems. See e.g., GluonCV** for 
details of such a scheduler. 


A warmup period before optimization can prevent divergence. 


Optimization serves multiple purposes in deep learning. Besides minimizing the training 
objective, different choices of optimization algorithms and learning rate scheduling can lead 
to rather different amounts of generalization and overfitting on the test set (for the same 
amount of training error). 


Exercises 


. Experiment with the optimization behavior for a given fixed learning rate. What is the best 


model you can obtain this way? 


. How does convergence change if you change the exponent of the decrease in the learning 


rate? Use PolyScheduler for your convenience in the experiments. 


. Apply the cosine scheduler to large computer vision problems, e.g., training ImageNet. How 


does it affect performance relative to other schedulers? 


. How long should warmup last? 


. Can you connect optimization and sampling? Start by using results from (Welling & Teh, 


2011) on Stochastic Gradient Langevin Dynamics. 


Discussions!“ 





45 http://gluon-cv.mxnet.io 
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12 Computational Performance 


In deep learning, datasets are usually large and model computation is complex. Therefore, we 
are always very concerned about computing performance. This chapter will focus on the impor- 
tant factors that affect computing performance: imperative programming, symbolic program- 
ming, asynchronous programing, automatic parallel computation, and multi-GPU computation. 
By studying this chapter, you should be able to further improve the computing performance of 
the models that have been implemented in the previous chapters, for example, by reducing the 
model training time without affecting the accuracy of the model. 


12.1 Compilers and Interpreters 


So far, this book has focused on imperative programming, which makes use of statements such as 
print, + or if to change a program’s state. Consider the following example of a simple imperative 
program. 


def add(a, b): 
return a + b 


def fancy_func(a, b, c, d): 


e = add(a, b) 
f = add(c, d) 
g = add(e, f) 
return g 


print(fancy_func(1, 2, 3, 4)) 


10 


Python is an interpreted language. When evaluating fancy_func it performs the operations mak- 
ing up the function’s body in sequence. That is, it will evaluate e = add(a, b) and it will store the 
results as variable e, thereby changing the program’s state. The next two statements f = add(c, 
d) and g = add(e, f) will be executed similarly, performing additions and storing the results as 
variables. Fig. 12.1.1 illustrates the flow of data. 
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add(e,f) 





Fig. 12.1.1: Data flow in an imperative program. 


Although imperative programming is convenient, it may be inefficient. On one hand, even if the 
add function is repeatedly called throughout fancy_func, Python will execute the three function 
calls individually. If these are executed, say, on a GPU (or even on multiple GPUs), the overhead 
arising from the Python interpreter can become overwhelming. Moreover, it will need to save 
the variable values of e and f until all the statements in fancy_func have been executed. This is 
because we do not know whether the variables e and f will be used by other parts of the program 
after the statements e = add(a, b) andf = add(c, d) have been executed. 


12.1.1 Symbolic Programming 


Consider the alternative, symbolic programming where computation is usually performed only 
once the process has been fully defined. This strategy is used by multiple deep learning frame- 
works, including Theano, Keras and TensorFlow (the latter two have since acquired imperative 
extensions). It usually involves the following steps: 


1. Define the operations to be executed. 
2. Compile the operations into an executable program. 
3. Provide the required inputs and call the compiled program for execution. 


This allows for a significant amount of optimization. First off, we can skip the Python interpreter 
in many cases, thus removing a performance bottleneck that can become significant on multiple 
fast GPUs paired with a single Python thread on a CPU. Secondly, a compiler might optimize and 
rewrite the above code into print((1 + 2) + (3 + 4)) or even print(10). This is possible since a 
compiler gets to see the full code before turning it into machine instructions. For instance, it can 
release memory (or never allocate it) whenever a variable is no longer needed. Or it can transform 
the code entirely into an equivalent piece. To get a better idea consider the following simulation 
of imperative programming (it is Python after all) below. 


def add_(): 
return 

def add(a, b): 
return a + b 


Oi 


Di) Uh 


def fancy_func_(): 
return '*” 

def fancy_func(a, b, c, d): 
e = add(a, b) 
f = add(c, d) 


(continues on next page) 
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(continued from previous page) 


g = add(e, f) 
RETURNS 


AAN 


def evoke_(): 
return add_() + fancy_func_() + 'print(fancy_func(1, 2, 3, 4))’ 


prog = evoke_() 

print(prog) 

y = compile(prog, '', 'exec') 
exec(y) 


def add(a, b): 
return a + b 


def fancy_func(a, b, c, d): 

e = add(a, b) 

f = add(c, d) 

g = add(e, f) 

return g 
print(fancy_func(1, 2, 3, 4)) 
10 


The differences between imperative (interpreted) programming and symbolic programming are 
as follows: 


+ Imperative programming is easier. When imperative programming is used in Python, the 
majority of the code is straightforward and easy to write. It is also easier to debug imperative 
programming code. This is because it is easier to obtain and print all relevant intermediate 
variable values, or use Python’s built-in debugging tools. 


e Symbolic programming is more efficient and easier to port. It makes it easier to optimize 
the code during compilation, while also having the ability to port the program into a format 
independent of Python. This allows the program to be run in a non-Python environment, 
thus avoiding any potential performance issues related to the Python interpreter. 


12.1.2 Hybrid Programming 


Historically most deep learning frameworks choose between an imperative or a symbolic ap- 
proach. For example, Theano, TensorFlow (inspired by the latter), Keras and CNTK formulate 
models symbolically. Conversely, Chainer and PyTorch take an imperative approach. An imper- 
ative mode was added to TensorFlow 2.0 (via Eager) and Keras in later revisions. 


When designing Gluon, developers considered whether it would be possible to combine the ben- 
efits of both programming models. This led to a hybrid model that lets users develop and de- 
bug using pure imperative programming, while having the ability to convert most programs into 
symbolic programs to be run when product-level computing performance and deployment are 
required. 


In practice this means that we build models using either the HybridBlock or the HybridSequential 
and HybridConcurrent classes. By default, they are executed in the same way Block or Sequential 
and Concurrent classes are executed in imperative programming. HybridSequential is a subclass 
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of HybridBlock (just like Sequential subclasses Block). When the hybridize function is called, 
Gluon compiles the model into the form used in symbolic programming. This allows one to op- 
timize the compute-intensive components without sacrifices in the way a model is implemented. 
We will illustrate the benefits below, focusing on sequential models and blocks only (the concur- 
rent composition works analogously). 


12.1.3 HybridSequential 


The easiest way to get a feel for how hybridization works is to consider deep networks with mul- 
tiple layers. Conventionally the Python interpreter will need to execute the code for all layers to 
generate an instruction that can then be forwarded to a CPU or a GPU. For a single (fast) compute 
device this does not cause any major issues. On the other hand, if we use an advanced 8-GPU 
server such as an AWS P3dn.24xlarge instance Python will struggle to keep all GPUs busy. The 
single-threaded Python interpreter becomes the bottleneck here. Let us see how we can address 
this for significant parts of the code by replacing Sequential by HybridSequential. We begin by 
defining a simple MLP. 


from d21 import mxnet as d21 
from mxnet import np, npx 
from mxnet.gluon import nn 
npx.set_np() 


# Factory for networks 
def get_net(): 
net = nn.HybridSequential() 
net.add(nn.Dense(256, activation='relu')>, 
nn.Dense(128, activation='relu’), 
nn.Dense(2)) 
net.initialize() 
return net 


x = np.random.normal(size=(1, 512)) 
net = get_net() 
net(x) 


array([[ 0.16526186, -0.140056281]) 


By calling the hybridize function, we are able to compile and optimize the computation in the 
MLP. The model's computation result remains unchanged. 


net.hybridize() 
net(x) 


array([[ 0.16526186, -0.140056281]) 


This seems almost too good to be true: simply designate a block to be HybridSequential, write 
the same code as before and invoke hybridize. Once this happens the network is optimized (we 
will benchmark the performance below). Unfortunately this does not work magically for every 
layer. That said, the blocks provided by Gluon are by default subclasses of HybridBlock and thus 
hybridizable. A layer will not be optimized if it inherits from the Block instead. 





514 Chapter 12. Computational Performance 


Acceleration by Hybridization 


To demonstrate the performance improvement gained by compilation we compare the time 
needed to evaluate net(x) before and after hybridization. Let us define a function to measure 
this time first. It will come handy throughout the chapter as we set out to measure (and improve) 
performance. 


#@save 
class Benchmark: 
def __init__(self, description='Done’): 
self.description = description 


def __enter__(self): 
self.timer = d21.Timer() 
return self 


def __exit__(self, xargs): 
print(f’{self.description}: (self.timer.stop():.4f) sec’) 


Now we can invoke the network twice, once with and once without hybridization. 


net = get_net() 

with Benchmark('Without hybridization’): 
for i in range(1000): net(x) 
npx.waitall() 


net .hybridize() 

with Benchmark('With hybridization’): 
for i in range(1000): net(x) 
npx.waitall() 


Without hybridization: 0.8061 sec 
With hybridization: 0.2216 sec 


Asis observed in the above results, after a HybridSequential instance calls the hybridize function, 
computing performance is improved through the use of symbolic programming. 


Serialization 


One of the benefits of compiling the models is that we can serialize (save) the model and its pa- 
rameters to disk. This allows us to store a model in a manner that is independent of the front-end 
language of choice. This allows us to deploy trained models to other devices and easily use other 
front-end programming languages. At the same time the code is often faster than what can be 
achieved in imperative programming. Let us see the export method in action. 


net.export(’my_mlp’) 
!1s -lh my_mlpx 


-rw-r--r-- 1 jenkins jenkins 643K Jan 18 05:38 my_mlp-0000.params 
-rw-r--r-- 1 jenkins jenkins 3.0K Jan 18 05:38 my_mlp-symbol. json 
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The model is decomposed into a (large binary) parameter file and a JSON description of the pro- 
gram required to execute to compute the model. The files can be read by other front-end languages 
supported by Python or MXNet, such as C++, R, Scala, and Perl. Let us have a look at the model 
description. 


!head my_mlp-symbol. json 


{ 
"nodes”: [ 

{ 
Moyes mn, 
"name": “data”, 
"inputs”: [] 

}, 

{ 
sueo, 
"name": "dense3_weight”, 


Things are slightly more tricky when it comes to models that resemble code more closely. Basically 
hybridization needs to deal with control flow and Python overhead in a much more immediate 
manner. Moreover, 


Contrary to the Block instance, which needs to use the forward function, for a HybridBlock in- 
stance we need to use the hybrid_forward function. 


Earlier, we demonstrated that, after callingthe hybridize method, the model is able to achieve su- 
perior computing performance and portability. Note, though that hybridization can affect model 
flexibility, in particular in terms of control flow. We will illustrate how to design more general 
models and also how compilation will remove spurious Python elements. 


class HybridNet(nn.HybridBlock) : 
def __init__(self, **xkwargs): 
super(HybridNet, self).__init__(*x*kwargs) 
self.hidden = nn.Dense(4) 
self.output = nn.Dense(2) 


def hybrid_forward(self, F, x): 


print('’'module F: E) 
print('value x: ', x) 
x = F.npx.relu(self.hidden(x)) 
Prunt@ results) 56) 


return self .output(x) 


The code above implements a simple network with 4 hidden units and 2 outputs. hybrid_forward 
takes an additional argument - the module F. This is needed since, depending on whether the code 
has been hybridized or not, it will use a slightly different library (ndarray or symbol) for processing. 
Both classes perform very similar functions and MXNet automatically determines the argument. 
To understand what is going on we print the arguments as part of the function invocation. 


net = HybridNet() 
net.initialize() 

x = np.random.normal(size=(1, 3)) 
net (x) 
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module F: <module 'mxnet.ndarray' from '/var/lib/jenkins/miniconda3/envs/d21-en-release-1/ 
>1ib/python3.8/site-packages/mxnet/ndarray/__init__.py'> 

value x:  [[-0.6338663 0.40156594 0.46456942]] 

result : [[0.01641375 0. ð. 0. 1] 


array([[0.00097611, 0.000194531]) 


Repeating the forward computation will lead to the same output (we omit details). Now let us see 
what happens if we invoke the hybridize method. 


net.hybridize() 
net(x) 


module F: <module 'mxnet.symbol' from '/var/lib/jenkins/miniconda3/envs/d21-en-release-1/ 
>1ib/python3.8/site-packages/mxnet/symbol/__init__.py'> 

value x: <_Symbol data> 

result : <_Symbol hybridnet0_relu0> 


array([[0.00097611, @.00019453]]) 


Instead of using ndarray we now use the symbol module for F. Moreover, even though the input is 
of ndarray type, the data flowing through the network is now converted to symbol type as part of 
the compilation process. Repeating the function call leads to a surprising outcome: 


net(x) 


array([[0.00097611, @.00019453]]) 


This is quite different from what we saw previously. All print statements, as defined in hy- 
brid_forward are omitted. Indeed, after hybridization the execution of net(x) does not involve 
the Python interpreter any longer. This means that any spurious Python code is omitted (such 
as print statements) in favor of a much more streamlined execution and better performance. In- 
stead, MXNet directly calls the C++ backend. Also note that some functions are not supported in 
the symbol module (like asnumpy) and operations in-place like a += banda[:] = a + b must 
be rewritten asa = a + b. Nonetheless, compilation of models is worth the effort whenever 
speed matters. The benefit can range from small percentage points to more than twice the speed, 
depending on the complexity of the model, the speed of the CPU and the speed and number of 
GPUs. 


Summary 


Imperative programming makes it easy to design new models since it is possible to write 
code with control flow and the ability to use a large amount of the Python software ecosys- 
tem. 


Symbolic programming requires that we specify the program and compile it before execut- 
ing it. The benefit is improved performance. 


MXNet is able to combine the advantages of both approaches as needed. 
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e Models constructed by the HybridSequential and HybridBlock classes are able to convert 
imperative programs into symbolic programs by calling the hybridize method. 


Exercises 


1. Design a network using the HybridConcurrent class. Alternatively look at Networks with Par- 
allel Concatenations (GoogLeNet) (page 272) for a network to compose. 


2. Add x.asnumpy() to the first line of the hybrid_forward function of the HybridNet class in 
this section. Execute the code and observe the errors you encounter. Why do they happen? 


3. What happens if we add control flow, i.e., the Python statements if and for in the hy- 
brid_forward function? 


4. Review the models that interest you in the previous chapters and use the HybridBlock class 
or HybridSequential class to implement them. 


Discussions!“ 


12.2 Asynchronous Computation 


Today’s computers are highly parallel systems, consisting of multiple CPU cores (often multiple 
threads per core), multiple processing elements per GPU and often multiple GPUs per device. In 
short, we can process many different things at the same time, often on different devices. Unfor- 
tunately Python is not a great way of writing parallel and asynchronous code, at least not with 
some extra help. After all, Python is single-threaded and this is unlikely to change in the future. 
Deep learning frameworks such as MXNet and TensorFlow utilize an asynchronous programming 
model to improve performance (PyTorch uses Python’s own scheduler leading to a different per- 
formance trade-off). For PyTorch, by default, GPU operations are asynchronous. When you call 
a function that uses the GPU, the operations are enqueued to the particular device, but not nec- 
essarily executed until later. This allows us to execute more computations in parallel, including 
operations on CPU or other GPUs. 


Hence, understanding how asynchronous programming works helps us to develop more efficient 
programs, by proactively reducing computational requirements and mutual dependencies. This 
allows us to reduce memory overhead and increase processor utilization. We begin by importing 
the necessary libraries. 


from d21 import mxnet as d21 

import numpy, os, subprocess 

from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 

npx.set_np() 





7 https://discuss.d21.ai/t/360 
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12.2.1 Asynchrony via Backend 


For a warmup consider the following toy problem - we want to generate a random matrix and 
multiply it. Let us do that both in NumPy and in MXNet NP to see the difference. 


with d21.Benchmark(’numpy’): 
for _ in range(10): 
a = numpy.random.normal(size=(1000, 1000)) 
b = numpy.dot(a, a) 


with d21.Benchmark(’mxnet.np’): 
for _ in range(10): 


a = np.random.normal(size=(1000, 1000)) 
b = np.dot(a, a) 


numpy: 0.8547 sec 
mxnet.np: 0.0046 sec 


This is orders of magnitude faster. At least it seems to be so. Since both are executed on the same 
processor something else must be going on. Forcing MXNet to finish all computation prior to 
returning shows what happened previously: computation is being executed by the backend while 
the frontend returns control to Python. 


with d21.Benchmark(): 
for _ in range(10): 
a = np.random.normal(size=(1000, 1000)) 
b = np.dot(a, a) 
npx.waitall() 


Done: 0.8341 sec 


Broadly speaking, MXNet has a frontend for direct interaction with the users, e.g., via Python, 
as well as a backend used by the system to perform the computation. As shown in Fig. 12.2.1, 
users can write MXNet programs in various frontend languages, such as Python, R, Scala, and C++. 
Regardless ofthe frontend programming language used, the execution of MXNet programs occurs 
primarily in the backend of C++implementations. Operations issued by the frontend language are 
passed on to the backend for execution. The backend manages its own threads that continuously 
collect and execute queued tasks. Note that for this to work the backend must be able to keep track 
of the dependencies between various steps in the computational graph. Hence, it is not possible 
to parallelize operations that depend on each other. 
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Fig. 12.2.1: Programming Frontends. 


Let us look at another toy example to understand the dependency graph a bit better. 


= np.ones((1, 2)) 
np.ones((1, 2)) 
=xxy+2 


N NSK X 
II 


array([[3., 3.]]) 


np.ones(1,2)  np.ones(1,2) 


0 


Fig. 12.2.2: Dependencies. 


The code snippet above is also illustrated in Fig. 12.2.2. Whenever the Python frontend thread 
executes one of the first three statements, it simply returns the task to the backend queue. When 
the last statement’s results need to be printed, the Python frontend thread will wait for the C++ 
backend thread to finish computing result of the variable z. One benefit of this design is that the 
Python frontend thread does not need to perform actual computations. Thus, there is little impact 
on the program’s overall performance, regardless of Python's performance. Fig. 12.2.3 illustrates 
how frontend and backend interact. 
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Frontend 


Backend 





Fig. 12.2.3: Frontend and Backend. 


12.2.2 Barriers and Blockers 


There are a number of operations that will force Python to wait for completion: * Most obviously 
npx.waitall() waits until all computation has completed, regardless of when the compute in- 
structions were issued. In practice it is a bad idea to use this operator unless absolutely necessary 
since it can lead to poor performance. * If we just want to wait until a specific variable is available 
we can call z.wait_to_read(). In this case MXNet blocks return to Python until the variable z has 
been computed. Other computation may well continue afterwards. 


Let us see how this works in practice: 


with d21.Benchmark('waitall'>): 
b = np.dot(a, a) 
npx.waitall() 


with d21.Benchmark('wait_to_read'): 
b = np.dot(a, a) 
b.wait_to_read() 


waitall: 0.0048 sec 
wait_to_read: 0.0045 sec 


Both operations take approximately the same time to complete. Besides the obvious blocking op- 
erations we recommend that the reader is aware of implicit blockers. Printing a variable clearly 
requires the variable to be available and is thus a blocker. Lastly, conversions to NumPy via z. 
asnumpy() and conversions to scalars via z.item() are blocking, since NumPy has no notion of 
asynchrony. It needs access to the values just like the print function. Copying small amounts 
of data frequently from MXNet’s scope to NumPy and back can destroy performance of an oth- 
erwise efficient code, since each such operation requires the computational graph to evaluate all 
intermediate results needed to get the relevant term before anything else can be done. 


with d21.Benchmark('numpy conversion’): 
b = np.dot(a, a) 
b. asnumpy () 


with d21.Benchmark('scalar conversion’): 
b = np.dot(a, a) 
b.sum().item() 
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numpy conversion: 0.0069 sec 
scalar conversion: 0.0144 sec 


12.2.3 Improving Computation 


On a heavily multithreaded system (even regular laptops have 4 threads or more and on multi- 
socket servers this number can exceed 256) the overhead of scheduling operations can become 
significant. This is why it is highly desirable to have computation and scheduling occur asyn- 
chronously and in parallel. To illustrate the benefit of doing this let us see what happens if we 
increment a variable by 1 multiple times, both in sequence or asynchronously. We simulate syn- 
chronous execution by inserting a wait_to_read() barrier in between each addition. 


with d21.Benchmark(’ synchronous’): 
for _ in range(100Q): 
y=x+1 
y.wait_to_read() 


with d2l.Benchmark('asynchronous’): 
for _ in range(1000): 
y=x+1 
y.wait_to_read() 


synchronous: 0.1041 sec 
asynchronous: 0.0934 sec 


A slightly simplified interaction between the Python frontend thread and the C++ backend thread 
can be summarized as follows: 


1. The frontend orders the backend to insert the calculation task y = x + 1 into the queue. 


2. The backend then receives the computation tasks from the queue and performs the actual 
computations. 


3. The backend then returns the computation results to the frontend. 


Assume that the durations of these three stages are t1,t2 and tz, respectively. If we do not use 
asynchronous programming, the total time taken to perform 1000 computations is approximately 
1000(t1 + ta + t3). If asynchronous programming is used, the total time taken to perform 1000 
computations can be reduced to tı + 1000t2 + t3 (assuming 1000t2 > 999t,), since the frontend 
does not have to wait for the backend to return computation results for each loop. 


12.2.4 Improving Memory Footprint 


Imagine a situation where we keep on inserting operations into the backend by executing Python 
code on the frontend. For instance, the frontend might insert a large number of minibatch tasks 
within a very short time. After all, if no meaningful computation happens in Python this can 
be done quite quickly. If each of these tasks can be launched quickly at the same time this may 
cause a spike in memory usage. Given a finite amount of memory available on GPUs (and even 
on CPUs) this can lead to resource contention or even program crashes. Some readers might have 
noticed that previous training routines made use of synchronization methods such as item or even 
asnumpy. 
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We recommend to use these operations carefully, e.g., for each minibatch, such as to balance 
computational efficiency and memory footprint. To illustrate what happens let us implement a 
simple training loop for a deep network and measure its memory consumption and timing. Below 
is the mock data generator and deep network. 


def data_iter(): 
timer = d21.Timer() 
num_batches, batch_size = 150, 1024 
for i in range(num_batches): 
X = np.random.normal(size=(batch_size, 512)) 
y = np.ones((batch_size, )) 
yield X, y 
if (i + 1) % 50 == 0: 
print(f'batch {i + 1}, time {timer.stop():.4f} sec’) 


net = nn.Sequential() 

net.add(nn.Dense(2048, activation='relu'), 
nn.Dense(512, activation='relu'), nn.Dense(1)) 

net.initialize() 

trainer = gluon.Trainer(net.collect_params(), 'sgd') 

loss = gluon.loss.L2Loss() 


Next we need a tool to measure the memory footprint of our code. We use a relatively primitive 
ps call to accomplish this (note that the latter only works on Linux and MacOS). For a much more 
detailed analysis of what is going on here use e.g., Nvidia’s Nsight!* or Intels vTune*”. 


def get_mem(): 
res = subprocess.check_output(['ps', 'u', '-p', str(os.getpid())]) 
return int(str(res).splitQ)[15]) / 1e3 


Before we can begin testing we need to initialize the parameters of the network and process one 
batch. Otherwise it would be tricky to see what the additional memory consumption is. See Sec- 
tion 5.3 for further details related to initialization. 


for X, y in data_iter(): 
break 
loss(y, net(X)).wait_to_read() 


To ensure that we do not overflow the task buffer on the backend we insert a wait_to_read call for 
the loss function at the end of each loop. This forces the forward propagation to complete before 
a new forward propagation is commenced. Note that a (possibly more elegant) alternative would 
have been to track the loss in a scalar variable and to force a barrier via the item call. 


mem = get_mem() 
with d21.Benchmark(’ time per epoch’): 
for X, y in data_iter(): 
with autograd.record(): 
1 = loss(y, net(X)) 

1.backward() 

trainer.step(X.shape[0]) 

l.wait_to_read() # Barrier before a new batch 


(continues on next page) 





48 https://developer.nvidia.com/nsight-compute-2019_5 
12 https://software.intel.com/en-us/vtune 
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(continued from previous page) 


npx.waitall() 
print(f'increased memory: {get_mem() - mem:f} MB’) 


batch 50, time 3.1852 sec 
batch 100, time 6.2475 sec 
batch 150, time 9.5391 sec 
time per epoch: 9.5523 sec 
increased memory: 24.284000 MB 


As we see, the timing of the minibatches lines up quite nicely with the overall runtime of the opti- 
mization code. Moreover, memory footprint only increases slightly. Now let us see what happens 
if we drop the barrier at the end of each minibatch. 


mem = get_mem() 
with d21.Benchmark(’ time per epoch’): 
for X, y in data_iter(): 


with autograd.record(): 
1 = loss(y, net(X)) 

1.backward() 

trainer.step(X.shape[0]) 


npx.waitall() 
print(f'increased memory: {get_mem() - mem:f) MB’) 


batch 50, time 0.1210 sec 
batch 100, time 0.2446 sec 
batch 150, time 0.3853 sec 
time per epoch: 9.8210 sec 
increased memory: -8.228000 MB 


Even though the time to issue instructions for the backend is an order of magnitude smaller, we 
still need to perform computation. Consequently a large amount of intermediate results cannot 
be released and may pile up in memory. While this didn't cause any issues in the toy example 
above, it might well have resulted in out of memory situations when left unchecked in real world 
scenarios. 


Summary 


MXNet decouples the Python frontend from an execution backend. This allows for fast asyn- 
chronous insertion of commands into the backend and associated parallelism. 


Asynchrony leads to a rather responsive frontend. However, use caution not to overfill the 
task queue since it may lead to excessive memory consumption. 


It is recommended to synchronize for each minibatch to keep frontend and backend approx- 
imately synchronized. 


Be aware of the fact that conversions from MXNet’s memory management to Python will 
force the backend to wait until the specific variable is ready. print, asnumpy and itemall have 
this effect. This can be desirable but a carless use of synchronization can ruin performance. 


Chip vendors offer sophisticated performance analysis tools to obtain a much more fine- 
grained insight into the efficiency of deep learning. 
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Exercises 


1. We mentioned above that using asynchronous computation can reduce the total amount of 
time needed to perform 1000 computations to tı + 1000t2 + t3. Why do we have to assume 
1000t2 > 999t¡ here? 


2. How would you need to modify the training loop if you wanted to have an overlap of one 
minibatch each? I.e., if you wanted to ensure that batch b; finishes before batch b; 2 com- 
mences? 


3. What might happen if we want to execute code on CPUs and GPUs simultaneously? Should 
you still insist on synchronizing after every minibatch has been issued? 


4. Measure the difference between waitall and wait_to_read. Hint: perform a number of 
instructions and synchronize for an intermediate result. 


Discussions! 


12.3 Automatic Parallelism 


MXNet automatically constructs computational graphs at the backend. Using a computational 
graph, the system is aware of all the dependencies, and can selectively execute multiple non- 
interdependent tasks in parallel to improve speed. For instance, Fig. 12.2.2 in Section 12.2 ini- 
tializes two variables independently. Consequently the system can choose to execute them in 
parallel. 


Typically, a single operator will use all the computational resources on all CPUs or on a single 
GPU. For example, the dot operator will use all cores (and threads) on all CPUs, even if there 
are multiple CPU processors on a single machine. The same applies to a single GPU. Hence par- 
allelization is not quite so useful single-device computers. With multiple devices things matter 
more. While parallelization is typically most relevant between multiple GPUs, adding the local 
CPU will increase performance slightly. See e.g., (Hadjis et al., 2016) for a paper that focuses on 
training computer vision models combining a GPU and a CPU. With the convenience of an auto- 
matically parallelizing framework we can accomplish the same goal in a few lines of Python code. 
More broadly, our discussion of automatic parallel computation focuses on parallel computation 
using both CPUs and GPUs, as well as the parallelization of computation and communication. We 
begin by importing the required packages and modules. Note that we need at least two GPUs to 
run the experiments in this section. 


from d21 import mxnet as d21 
from mxnet import np, npx 
npx.set_np() 





150 https://discuss.d21.ai/t/361 
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12.3.1 Parallel Computation on GPUs 


Let us start by defining a reference workload to test - the run function below performs 10 matrix- 
matrix multiplications on the device of our choosing using data allocated into two variables, 
x_gpul and x_gpu2. 


devices = d21.try_all_gpus() 
def run(x): 
return [x.dot(x) for _ in range(50)] 


x_gpul = np.random.uniform(size=(4000, 4000), ctx=devices[0]) 
x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1]) 


Now we apply the function to the data. To ensure that caching does not play a role in the results 
we warm up the devices by performing a single pass on each of them prior to measuring. 


run(x_gpul) # Warm-up both devices 
run(x_gpu2) 
npx.waitall() 


with d21.Benchmark(’GPU1 time’): 
run(x_gpul) 
npx.waitall() 


with d21.Benchmark('GPU2 time’): 
run(x_gpu2) 
npx.waitall() 


GPU1 time: 0.4871 sec 
GPU2 time: 0.5000 sec 


If we remove the waitall() between both tasks the system is free to parallelize computation on 
both devices automatically. 


with d21.Benchmark('GPU1 & GPU2’): 
run(x_gpul) 
run(x_gpu2) 
npx.waitall() 


GPU1 & GPU2: 0.5049 sec 


In the above case the total execution time is less than the sum of its parts, since MXNet auto- 
matically schedules computation on both GPU devices without the need for sophisticated code on 
behalf of the user. 
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12.3.2 Parallel Computation and Communication 


In many cases we need to move data between different devices, say between CPU and GPU, or be- 
tween different GPUs. This occurs e.g., when we want to perform distributed optimization where 
we need to aggregate the gradients over multiple accelerator cards. Let us simulate this by com- 
puting on the GPU and then copying the results back to the CPU. 


def copy_to_cpu(x): 
return [y.copyto(npx.cpu()) for y in x] 


with d21.Benchmark(’Run on GPU1’): 


y = run(x_gpul) 
npx.waitall() 


with d21.Benchmark('Copy to CPU’): 


y_cpu = copy_to_cpu(y) 
npx.waitall() 


Run on GPU1: 0.5193 sec 
Copy to CPU: 2.3719 sec 


This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while 
the remainder of the list is still being computed. This situation occurs, e.g., when we compute the 
(backprop) gradient on a minibatch. The gradients of some of the parameters will be available 
earlier than that of others. Hence it works to our advantage to start using PCI-Express bus band- 
width while the GPU is still running. Removing waitall between both parts allows us to simulate 
this scenario. 


with d21.Benchmark('Run on GPU1 and copy to CPU’): 
y = run(x_gpul) 
y_cpu = copy_to_cpu(y) 
npx.waitall() 


Run on GPU1 and copy to CPU: 2.5454 sec 


The total time required for both operations is (as expected) significantly less than the sum of their 
parts. Note that this task is different from parallel computation as it uses a different resource: the 
bus between CPU and GPUs. In fact, we could compute on both devices and communicate, all at 
the same time. As noted above, there is a dependency between computation and communication: 
y[i] must be computed before it can be copied to the CPU. Fortunately, the system can copy y[i-1] 
while computing y[i] to reduce the total running time. 


We conclude with an illustration of the computational graph and its dependencies for a simple 
two-layer MLP when training on a CPU and two GPUs, as depicted in Fig. 12.3.1. It would be quite 
painful to schedule the parallel program resulting from this manually. This is where it is advan- 
tageous to have a graph based compute backend for optimization. 
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data[gpu0].copyfrom(data[0:50]) data = next_batch() data[gpu0].copyfrom(data[51:100]) 


fc1[gpu0] = fc2_wgrad[cpu] = fc1[gput] = 
FullcForward(data[gpu0], fc2_wgrad[gpu0] + FullcForward(data[gpu1], 
fc1_weight[gpu0]) fc2_wgrad[gpu1] fc1_weight[gpu1]) 
fc2[gpu0] = fc2_weight[cpu] -= fc2[gpu1] = 
FullcForward(fc1[gpu0], Ir*fo12 wgrad[gpu0] FullcForward(fc1[gpu1], 


fc2_weight[gpu0]) 


fc2_ograd[gpu0] = 
LossGrad(fc2[gpu0], label[0:50]) 


fc1_ograd[gpu0], fc2_wgrad[gpu0] 
= FullcBackward(fc2_ograd[gpu0] , 
fc2_weight[gpu0]) 


fc2_weight[gpu1]) 


fc2_ograd[gpu1] = 
LossGrad(fc2[gpu1], label[51:100]) 


fc1_ograd[gpu1], fe2_wgrad[gpu1] 
= FullcBackward(fc2_ograd[gpu1] , 
fc2_weight[gpu1]) 











fc2_weight[cpu].copyto( 
fc2_weight[gpu0] , 
fc2_weight[gpu1]) 







fc1_wgrad[cpu] = 
fe1_wgrad[gpu0] + 
fc1_wgrad[gpu1] 


_, f¢1_wgrad[gpu0] = fc1_weight[cpu] -= Ir * — f¢1_wgrad[gpu1] = 
FullcBackward(fc1_ograd[gpu0] , fc1_wgrad[gpu0] FullcBackward(fc1_ograd[gpuí] , 
fc1_weight[gpu0]) = fc1_weight[gpu1]) 


fc1_weight[cpu].copyto( 
fc1_weight[gpu0] , 
fc1_weight[gpu1]) 


Fig. 12.3.1: Two layer MLP on a CPU and 2 GPUs. 


Summary 


e Modern systems have a variety of devices, such as multiple GPUs and CPUs. They can be 
used in parallel, asynchronously. 


e Modern systems also have a variety of resources for communication, such as PCI Express, 
storage (typically SSD or via network), and network bandwidth. They can be used in parallel 
for peak efficiency. 


* The backend can improve performance through through automatic parallel computation 
and communication. 
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Exercises 


1. 10 operations were performed in the run function defined in this section. There are no de- 
pendencies between them. Design an experiment to see if MXNet will automatically execute 
them in parallel. 


2. When the workload of an individual operator is sufficiently small, parallelization can help 
even on a single CPU or GPU. Design an experiment to verify this. 


3. Design an experiment that uses parallel computation on CPU, GPU and communication be- 
tween both devices. 


4. Use a debugger such as NVIDIA's Nsight to verify that your code is efficient. 


5. Designing computation tasks that include more complex data dependencies, and run exper- 
iments to see if you can obtain the correct results while improving performance. 


Discussions?! 


12.4 Hardware 


Building systems with great performance requires a good understanding of the algorithms and 
models to capture the statistical aspects of the problem. At the same time it is also indispensable 
to have at least a modicum of knowledge of the underlying hardware. The current section is no 
substitute for a proper course on hardware and systems design. Instead, it might serve as a starting 
point for understanding why some algorithms are more efficient than others and how to achieve 
good throughput. Good design can easily make a difference of an order of magnitude and, in turn, 
this can make the difference between being able to train a network (e.g., in a week) or not at all (in 
3 months, thus missing the deadline). We will start by looking at computers. Then we will zoom in 
to look more carefully at CPUs and GPUs. Lastly we zoom out to review how multiple computers 
are connected in a server center or in the cloud. This is not a GPU purchase guide. For this review 
Section 19.5. An introduction to cloud computing with AWS can be found in Section 19.3. 


Impatient readers may be able to get by with Fig. 12.4.1. It is taken from Colin Scott's interactive 
post!” which gives a good overview of the progress over the past decade. The original numbers are 
due to Jeff Dean’s Stanford talk from 2010**%, The discussion below explains some of the rationale 
for these numbers and how they can guide us in designing algorithms. The discussion below is 
very high level and cursory. It is clearly no substitute for a proper course but rather just meant to 
provide enough information for a statistical modeler to make suitable design decisions. For an 
in-depth overview of computer architecture we refer the reader to (Hennessy & Patterson, 2011) 
or a recent course on the subject, such as the one by Arste Asanovic!*. 





5! https://discuss.d21.ai/t/362 

152 https://people.eecs.berkeley.edu/-rcs/research/interactive_latency.html 

15 https://static.googleusercontent.com/media/research.google.com/en//people/jeff/Stanford- DL-Nov-2010.pdf 
15* http://inst.eecs.berkeley.edu/~cs152/sp19/ 
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Fig. 12.4.1: Latency Numbers every Programmer should know. 


12.4.1 Computers 


Most deep learning researchers have access to a computer with a fair amount of memory, com- 
pute, some form of an accelerator such as a GPU, or multiples thereof. It consists of several key 
components: 


e A processor, also referred to as CPU which is able to execute the programs we give it (in 
addition to running an operating system and many other things), typically consisting of 8 or 
more cores. 


Memory (RAM) to store and retrieve the results from computation, such as weight vectors, 
activations, often training data. 


An Ethernet network connection (sometimes multiple) with speeds ranging from 1Gbit/s to 
100Gbit/s (on high end servers more advanced interconnects can be found). 


A high speed expansion bus (PCIe) to connect the system to one or more GPUs. Servers have 
up to 8 accelerators, often connected in an advanced topology, desktop systems have 1-2, 
depending on the budget of the user and the size of the power supply. 


Durable storage, such as a magnetic harddrive (HDD), solid state (SSD), in many cases con- 
nected using the PCIe bus, provides efficient transfer of training data to the system and stor- 
age of intermediate checkpoints as needed. 






PCle bus 


Chipset 


Fig. 12.4.2: Connectivity of components 


As Fig. 12.4.2 indicates, most components (network, GPU, storage) are connected to the CPU across 
the PCI Express bus. It consists of multiple lanes that are directly attached to the CPU. For instance 
AMD’s Threadripper 3 has 64 PCIe 4.0 lanes, each of which is capable 16 Gbit/s data transfer in both 
directions. The memory is directly attached to the CPU with a total bandwidth of up to 100 GB/s. 
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When we run code on a computer we need to shuffle data to the processors (CPU or GPU), perform 
computation and then movethe results offthe processor backto RAM and durable storage. Hence, 
in order to get good performance we need to make sure that this works seamlessly without any 
one of the systems becoming a major bottleneck. For instance, if we cannot load images quickly 
enough the processor will not have any work to do. Likewise, if we cannot move matrices quickly 
enough to the CPU (or GPU), its processing elements will starve. Finally, if we want to synchronize 
multiple computers across the network, the latter should not slow down computation. One option 
is to interleave communication and computation. Let us have a look at the various components in 
more detail. 


12.4.2 Memory 


At its most basic memory is used to store data that needs to be readily accessible. At present CPU 
RAM is typically of the DDR4**” variety, offering 20-25GB/s bandwidth per module. Each module 
has a 64 bit wide bus. Typically pairs of memory modules are used to allow for multiple channels. 
CPUs have between 2 and 4 memory channels, i.e., they have between 40GB/s and 100GB/s peak 
memory bandwidth. Often there are two banks per channel. For instance AMD's Zen 3 Thread- 
ripper has 8 slots. 


While these numbers are impressive, indeed, they only tell part of the story. When we want to 
read a portion from memory we first need to tell the memory module where the information can 
be found. That is, we first need to send the address to RAM. Once this accomplished we can choose 
to read just a single 64bit record or a long sequence of records. The latter is called burst read. In 
a nutshell, sending an address to memory and setting up the transfer takes approximately 100ns 
(details depend on the specific timing coefficients of the memory chips used), every subsequent 
transfer takes only 0.2ns. In short, the first read is 500 times as expensive as subsequent ones! We 
could perform up to 10,000,000 random reads per second. This suggests that we avoid random 
memory access as far as possible and use burst reads (and writes) instead. 


Matters are a bit more complex when we take into account that we have multiple banks. Each bank 
can read memory largely independently. This means two things: the effective number of random 
reads is up to 4x higher, provided that they are spread evenly across memory. It also means that 
it is still a bad idea to perform random reads since burst reads are 4x faster, too. Secondly, due 
to memory alignment to 64 bit boundaries it is a good idea to align any datastructures with the 
same boundaries. Compilers do this pretty much automatically!" when the appropriate flags are 
set. Curious readers are encouraged to review a lecture on DRAMs such as the one by Zeshan 
Chishti?””, 


GPU memory is subject to even higher bandwidth requirements since they have many more pro- 
cessing elements than CPUs. By and large there are two options to address them. One is to make 
the memory bus significantly wider. For instance NVIDIA RTX 2080 Ti has a 352 bit wide bus. 
This allows for much more information to be transferred at the same time. Secondly, GPUs use 
specific high-performance memory. Consumer grade devices, such as NVIDIA RTX and Titan 
series typically use GDDR6** chips with over 500 GB/s aggregate bandwidth. An alternative is to 
use HBM (high bandwidth memory) modules. They use a very different interface and connect 
directly with GPUs on a dedicated silicon wafer. This makes them very expensive and their use is 
typically limited to high end server chips, such as the NVIDIA Volta V100 series of accelerators. 
Quite unsurprisingly GPU memory is much smaller than CPU memory due to its higher cost. For 





155 https://en.wikipedia.org/wiki/DDR4_SDRAM 

156 https://en.wikipedia.org/wiki/Data_structure_alignment 
157 http://web.cecs.pdx.edu/~zeshan/ece585_lec5. pdf 

158 https://en.wikipedia.org/wiki/GDDR6_SDRAM 
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our purposes, by and large their performance characteristics are similar, just a lot faster. We can 
safely ignore the details for the purpose of this book. They only matter when tuning GPU kernels 
for high throughput. 


12.4.3 Storage 


We saw that some of the key characteristics of RAM were bandwidth and latency. The same is true 
for storage devices, just that the differences can be even more extreme. 


Hard Disks have been in use for over half a century. In a nutshell they contain a number of spin- 
ning platters with heads that can be positioned to read / write at any given track. High end end 
disks hold up to 16TB on 9 platters. One of the key benefits of HDDs is that they are relatively in- 
expensive. One of their many downsides are their typically catastrophic failure modes and their 
relatively high read latency. 


To understand the latter, consider the fact that HDDs spin at around 7,200 RPM. Ifthey were much 
faster they would shatter due to the centrifugal force exerted on the platters. This has a major 
downside when it comes to accessing a specific sector on the disk: we need to wait until the platter 
has rotated in position (we can move the heads but not accelerate the actual disks). Hence it can 
take over 8ms until the requested data is available. A common way this is expressed is to say that 
HDDs can operate at approximately 100 IOPs. This number has essentially remained unchanged 
for the past two decades. Worse still, it is equally difficult to increase bandwidth (it is in the order 
of 100-200 MB/s). After all, each head reads a track of bits, hence the bit rate only scales with 
the square root of the information density. As a result HDDs are quickly becoming relegated to 
archival storage and low-grade storage for very large datasets. 


Solid State Drives use Flash memory to store information persistently. This allows for much faster 
access to stored records. Modern SSDs can operate at 100,000 to 500,000 IOPs, i.e., up to 3 orders 
of magnitude faster than HDDs. Furthermore, their bandwidth can reach 1-3GB/s, i.e., one order 
of magnitude faster than HDDs. These improvements sound almost too good to be true. Indeed, 
they come with a number of caveats, due to the way SSDs are designed. 


e SSDs store information in blocks (256 KB or larger). They can only be written as a whole, 
which takes significant time. Consequently bit-wise random writes on SSD have very poor 
performance. Likewise, writing data in general takes significant time since the block has 
to be read, erased and then rewritten with new information. By now SSD controllers and 
firmware have developed algorithms to mitigate this. Nonetheless writes can be much 
slower, in particular for QLC (quad level cell) SSDs. The key for improved performance is to 
maintain a queue of operations, to prefer reads and to write in large blocks if possible. 


The memory cells in SSDs wear out relatively quickly (often already after a few thousand 
writes). Wear-level protection algorithms are able to spread the degradation over many cells. 
That said, it is not recommended to use SSDs for swap files or for large aggregations of log- 
files. 


Lastly, the massive increase in bandwidth has forced computer designers to attach SSDs di- 
rectly to the PCIe bus. The drives capable of handling this, referred to as NVMe (Non Volatile 
Memory enhanced), can use up to 4 PCIe lanes. This amounts to up to 8GB/s on PCIe 4.0. 


Cloud Storage provides a configurable range of performance. That is, the assignment of storage 
to virtual machines is dynamic, both in terms of quantity and in terms speed, as chosen by the 
user. We recommend that the user increase the provisioned number of IOPs whenever latency is 
too high, e.g., during training with many small records. 
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12.4.4 CPUs 


Central Processing Units (CPUs) are the centerpiece of any computer (as before we give a very 
high level description focusing primarily on what matters for efficient deep learning models). 
They consist of a number of key components: processor cores which are able to execute machine 
code, a bus connecting them (the specific topology differs significantly between processor mod- 
els, generations and vendors), and caches to allow for higher bandwidth and lower latency mem- 
ory access than what is possible by reads from main memory. Lastly, almost all modern CPUs 
contain vector processing units to aid with high performance linear algebra and convolutions, as 
they are common in media processing and machine learning. 
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Fig. 12.4.3: Intel Skylake consumer quad-core CPU 


Fig. 12.4.3 depicts an Intel Skylake consumer grade quad-core CPU. It has an integrated GPU, 
caches, and a ringbus connecting the four cores. Peripherals (Ethernet, WiFi, Bluetooth, SSD 
controller, USB, etc.) are either part of the chipset or directly attached (PCIe) to the CPU. 


Microarchitecture 


Each of the processor cores consists of a rather sophisticated set of components. While details 
differ between generations and vendors, the basic functionality is pretty much standard. The 
front end loads instructions and tries to predict which path will be taken (e.g., for control flow). 
Instructions are then decoded from assembly code to microinstructions. Assembly code is often 
not the lowest level code that a processor executes. Instead, complex instructions may be decoded 
into a set of more lower level operations. These are then processed by the actual execution core. 
Often the latter is capable of performing many operations simultaneously. For instance, the ARM 
Cortex A77 core of Fig. 12.4.4 is able to perform up to 8 operations simultaneously. 
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Fig. 12.4.4: ARM Cortex A77 Microarchitecture Overview 


This means that efficient programs might be able to perform more than one instruction per clock 
cycle, provided that they can be carried out independently. Not all units are created equal. Some 
specialize in integer instructions whereas others are optimized for floating point performance. 
To increase throughput, the processor might also follow multiple codepaths simultaneously in a 
branching instruction and then discard the results of the branches not taken. This is why branch 
prediction units matter (on the frontend) such that only the most promising paths are pursued. 


Vectorization 


Deep learning is extremely compute hungry. Hence, to make CPUs suitable for machine learning, 
one needs to perform many operations in one clock cycle. This is achieved via vector units. They 
have different names: on ARM they are called NEON, on x86 the latest generation is referred to as 
AVX2**” units. A common aspect is that they are able to perform SIMD (single instruction multiple 
data) operations. Fig. 12.4.5 shows how 8 short integers can be added in one clock cycle on ARM. 
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Fig. 12.4.5: 128 bit NEON vectorization 
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Depending on architecture choices, such registers are up to 512 bit long, allowing for the combina- 
tion of up to 64 pairs of numbers. For instance, we might be multiplying two numbers and adding 
them to a third, which is also known as a fused multiply-add. Intel’s OpenVino!” uses these to 
achieve respectable throughput for deep learning on server grade CPUs. Note, though, that this 
number is entirely dwarved by what GPUs are capable of achieving. For instance, NVIDIA’s RTX 
2080 Ti has 4,352 CUDA cores, each of which is capable of processing such an operation at any 
time. 


Cache 


Consider the following situation: we have a modest CPU core with 4 cores as depicted in Fig. 12.4.3 
above, running at 2GHz frequency. Moreover, let us assume that we have an IPC (instructions per 
clock) count of 1 and that the units have AVX2 with 256bit width enabled. Let us furthermore as- 
sume that at least one of the registers used for AVX2 operations needs to be retrieved from mem- 
ory. This means that the CPU consumes 4x256bit = 1kbit of data per clock cycle. Unless we are 
able to transfer 2 - 10? - 128 = 256 - 10° bytes to the processor per second the processing elements 
are going to starve. Unfortunately the memory interface of such a chip only supports 20-40 GB/s 
data transfer, i.e., one order of magnitude less. The fix is to avoid loading new data from memory 
as far as possible and rather to cache it locally on the CPU. This is where caches come in handy 
(see this Wikipedia article’! for a primer). Commonly the following names / concepts are used: 


+ Registers are strictly speaking not part of the cache. They help stage instructions. That said, 
CPU registers are memory locations that a CPU can access at clock speed without any delay 
penalty. CPUs have tens of registers. It is up to the compiler (or programmer) to use registers 
efficiently. For instance the C programming language has a register keyword. 


L1 caches are the first line of defense against high memory bandwidth requirements. L1 
caches are tiny (typical sizes might be 32-64kB) and often split into data and instructions 
caches. When data is found in the L1 cache access is very fast. If it cannot be found there, 
the search progresses down the cache hierarchy. 


L2 caches are the next stop. Depending on architecture design and processor size they might 
be exclusive. They might be accessible only by a given core or shared between multiple 
cores. L2 caches are larger (typically 256-512kB per core) and slower than L1. Furthermore, 
to access something in L2 we first need to check to realize that the data is not in L1, which 
adds a small amount of extra latency. 


L3 caches are shared between multiple cores and can be quite large. AMD’s Epyc 3 server 
CPUs have a whopping 256MB of cache spread across multiple chiplets. More typical num- 
bers are in the 4-8MB range. 


Predicting which memory elements will be needed next is one of the key optimization parame- 
ters in chip design. For instance, it is advisable to traverse memory in a forward direction since 
most caching algorithms will try to read ahead rather than backwards. Likewise, keeping memory 
access patterns local is a good way of improving performance. Adding caches is a double-edge 
sword. On one hand they ensure that the processor cores do not starve of data. At the same time 
they increase chip size, using up area that otherwise could have been spent on increasing process- 
ing power. Moreover, cache misses can be expensive. Consider the worst case scenario, depicted 
in Fig. 12.4.6. A memory location is cached on processor 0 when a thread on processor 1 requests 
the data. To obtain it, processor 0 needs to stop what it is doing, write the information back to 





160 https://01.org/openvinotoolkit 
161 https://en.wikipedia.org/wiki/Cache_hierarchy 
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main memory and then let processor 1 read it from memory. During this operation both proces- 
sors wait. Quite potentially such code runs more slowly on multiple processors when compared to 
an efficient single-processor implementation. This is one more reason for why there is a practical 


limit to cache sizes (besides their physical size). 
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Fig. 12.4.6: False sharing (image courtesy of Intel) 


12.4.5 GPUs and other Accelerators 


Itis not an exaggeration to claim that deep learning would not have been successful without GPUs. 
By the same token, it is quite reasonable to argue that GPU manufacturers’ fortunes have been 
increased significantly due to deep learning. This co-evolution of hardware and algorithms has 
led to a situation where for better or worse deep learning is the preferable statistical modeling 
paradigm. Hence it pays to understand the specific benefits that GPUs and related accelerators 
such as the TPU (Jouppi et al., 2017) offer. 


Of note is a distinction that is often made in practice: accelerators are optimized either for train- 
ing or inference. For the latter we only need to compute the forward propagation in a network. 
No storage of intermediate data is needed for backpropagation. Moreover, we may not need very 
precise computation (FP16 or INT8 typically suffice). On the other hand, during training all inter- 
mediate results need storing to compute gradients. Moreover, accumulating gradients requires 
higher precision to avoid numerical underflow (or overflow). This means that FP16 (or mixed 
precision with FP32) is the minimum required. All of this necessitates faster and larger memory 
(HBM2 vs. GDDR6) and more processing power. For instance, NVIDIA Turing!” T4 GPUs are 
optimized for inference whereas the V100 GPUs are preferable for training. 


Recall Fig. 12.4.5. Adding vector units to a processor core allowed us to increase throughput sig- 
nificantly (in the example in the figure we were able to perform 16 operations simultaneously). 
What if we added operations that optimized not just operations between vectors but also between 
matrices? This strategy led to Tensor Cores (more on this shortly). Secondly, what if we added 
many more cores? In a nutshell, these two strategies summarize the design decisions in GPUs. 
Fig. 12.4.7 gives an overview over a basic processing block. It contains 16 integer and 16 float- 
ing point units. In addition to that, two Tensor Cores accelerate a narrow subset of additional 
operations relevant for deep learning. Each Streaming Multiprocessor (SM) consists of four such 
blocks. 


162 https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/ 
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Register File (16,384 x 32-bit) 
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Fig. 12.4.7: NVIDIA Turing Processing Block (image courtesy of NVIDIA) 


12 streaming multiprocessors are then grouped into graphics processing clusters which make up 
the high-end TU102 processors. Ample memory channels and an L2 cache complement the setup. 
Fig. 12.4.8 has the relevant details. One of the reasons for designing such a device is that individual 
blocks can be added or removed as needed to allow for more compact chips and to deal with yield 
issues (faulty modules might not be activated). Fortunately programming such devices is well 
hidden from the casual deep learning researcher beneath layers of CUDA and framework code. 
In particular, more than one of the programs might well be executed simultaneously on the GPU, 
provided that there are available resources. Nonetheless it pays to be aware of the limitations of 
the devices to avoid picking models that do not fit into device memory. 





Fig. 12.4.8: NVIDIA Turing Architecture (image courtesy of NVIDIA) 


A last aspect that is worth mentioning in more detail are TensorCores. They are an example of 
a recent trend of adding more optimized circuits that are specifically effective for deep learning. 
For instance, the TPU added a systolic array (Kung, 1988) for fast matrix multiplication. There 
the design was to support a very small number (one for the first generation of TPUs) of large op- 
erations. TensorCores are at the other end. They are optimized for small operations involving 
between 4x4 and 16x16 matrices, depending on their numerical precision. Fig. 12.4.9 gives an 
overview of the optimizations. 
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Fig. 12.4.9: NVIDIA TensorCores in Turing (image courtesy of NVIDIA) 


Obviously when optimizing for computation we end up making certain compromises. One of 
them is that GPUs are not very good at handling interrupts and sparse data. While there are no- 
table exceptions, such as Gunrock!* (Wang et al., 2016), the access pattern of sparse matrices and 
vectors do not go well with the high bandwidth burst read operations where GPUs excel. Match- 
ing both goals is an area of active research. See e.g., DGL*%*, a library tuned for deep learning on 
graphs. 


12.4.6 Networks and Buses 


Whenever a single device is insufficient for optimization we need to transfer data to and from it 
to synchronize processing. This is where networks and buses come in handy. We have a number 
of design parameters: bandwidth, cost, distance and flexibility. On one end we have WiFi which 
has a pretty good range, is very easy to use (no wires, after all), cheap but it offers comparatively 
mediocre bandwidth and latency. No machine learning researcher within their right mind would 
use it to build a cluster of servers. In what follows we focus on interconnects that are suitable for 
deep learning. 


* PCIe is a dedicated bus for very high bandwidth point to point connections (up to 16 Gbs on 
PCIe 4.0) per lane. Latency is in the order of single-digit microseconds (5 us). PCIe links 
are precious. Processors only have a limited number of them: AMD's EPYC 3 has 128 lanes, 
Intel's Xeon has up to 48 lanes per chip; on desktop grade CPUs the numbers are 20 (Ryzen 
9) and 16 (Core i9) respectively. Since GPUs have typically 16 lanes this limits the number of 
GPUs that can connect to the CPU at full bandwidth. After all, they need to share the links 
with other high bandwidth peripherals such as storage and Ethernet. Just like with RAM 
access, large bulk transfers are preferable due to reduced packet overhead. 


Ethernet is the most commonly used way of connecting computers. While it is significantly 
slower than PCle, it is very cheap and resilient to install and covers much longer distances. 
Typical bandwidth for low-grade servers is 1 GBit/s. Higher end devices (e.g., C5 instances! 
in the cloud) offer between 10 and 100 GBit/s bandwidth. As in all previous cases data trans- 
mission has significant overheads. Note that we almost never use raw Ethernet directly but 





16% https://github.com/gunrock/gunrock 
16 http://dgl.ai 
165 https://aws.amazon.com/ec2/instance-types/c5/ 
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rather a protocol that is executed on top of the physical interconnect (such as UDP or TCP/IP). 
This adds further overhead. Like PCIe, Ethernet is designed to connect two devices, e.g., a 
computer and a switch. 


Switches allow us to connect multiple devices in a manner where any pair of them can 
carry out a (typically full bandwidth) point to point connection simultaneously. For in- 
stance, Ethernet switches might connect 40 servers at high cross-sectional bandwidth. Note 
that switches are not unique to traditional computer networks. Even PCIe lanes can be 
switched*%, This occurs e.g., to connect a large number of GPUs to a host processor, as 
is the case for the P2 instances!*’. 


NVLink is an alternative to PCIe when it comes to very high bandwidth interconnects. It 
offers up to 300 Gbit/s data transfer rate per link. Server GPUs (Volta V100) have 6 links 
whereas consumer grade GPUs (RTX 2080 Ti) have only one link, operating at a reduced 100 
Gbit/s rate. We recommend to use NCCL*° to achieve high data transfer between GPUs. 


Summary 


Devices have overheads for operations. Hence it is important to aim for a small number of 
large transfers rather than many small ones. This applies to RAM, SSDs, Networks and GPUs. 


Vectorization is key for performance. Make sure you are aware of the specific abilities of your 
accelerator. E.g., some Intel Xeon CPUs are particularly good for INT8 operations, NVIDIA 
Volta GPUs excel at FP16 matrix-matrix operations and NVIDIA Turing shines at FP16, INT8 
and INT4 operations. 


Numerical overflow due to small datatypes can be a problem during training (and to a lesser 
extent during inference). 


Aliasing can significantly degrade performance. For instance, memory alignment on 64 bit 
CPUs should be done with respect to 64 bit boundaries. On GPUs it is a good idea to keep 
convolution sizes aligned e.g., to TensorCores. 


Match your algorithms to the hardware (memory footprint, bandwidth, etc.). Great speedup 
(orders of magnitude) can be achieved when fitting the parameters into caches. 


We recommend that you sketch out the performance of a novel algorithm on paper before 
verifying the experimental results. Discrepancies of an order-of-magnitude or more are rea- 
sons for concern. 


Use profilers to debug performance bottlenecks. 


Training and inference hardware have different sweet spots in terms of price / performance. 





166 https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches 
167 https://aws.amazon.com/ec2/instance-types/p2/ 
168 hitps://github.com/NVIDIA/nccl 
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12.4.7 More Latency Numbers 


The summary in Table 12.4.1 and Table 12.4.2 are due to Eliot Eshelman!** who maintains an 
updated version of the numbers as a GitHub Gist!”°. 


Table 12.4.1: Common Latency Numbers. 


























Action Time Notes 

L1 cache reference/hit 1.5ns | 4cycles 
Floating-point add/mult/FMA 1.5ns | 4cycles 

L2 cache reference/hit 5 ns 12 ~ 17 cycles 
Branch mispredict 6ns 15 ~ 20 cycles 
L3 cache hit (unshared cache) 16 ns 42 cycles 

L3 cache hit (shared in another core) 25 ns 65 cycles 
Mutex lock/unlock 25 ns 





L3 cache hit (modified in another core) | 29 ns 75 cycles 
L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns) 












































QPI hop to a another CPU (per hop) 40 ns 

64MB memory ref. (local CPU) 46ns | TinyMemBench on Broadwell E5-2690v4 
64MB memory ref. (remote CPU) 70 ns TinyMemBench on Broadwell E5-2690v4 
256MB memory ref. (local CPU) 75 ns TinyMemBench on Broadwell E5-2690v4 
Intel Optane random write 94 ns UCSD Non-Volatile Systems Lab 

256MB memory ref. (remote CPU) 120ns | TinyMemBench on Broadwell E5-2690v4 
Intel Optane random read 305 ns | UCSD Non-Volatile Systems Lab 

Send 4KB over 100 Gbps HPC fabric 1 us MVAPICH2 over Intel Omni-Path 
Compress 1KB with Google Snappy 3 us 

Send 4KB over 10 Gbps ethernet 10 us 

Write 4KB randomly to NVMe SSD 30 us DC P3608 NVMe SSD (QOS 99% is 500us) 
Transfer 1MB to/from NVLink GPU 30 us ~33GB/s on NVIDIA 40GB NVLink 
Transfer 1MB to/from PCI-E GPU 80 us ~12GB/s on PCIe 3.0 x16 link 





Read 4KB randomly from NVMe SSD 120 us | DC P3608 NVMe SSD (QOS 99%) 
Read 1MB sequentially from NVMe SSD | 208 us | ~4.8GB/s DC P3608 NVMe SSD 


























Write 4KB randomly to SATA SSD 500 us | DC S3510 SATA SSD (QOS 99.9%) 
Read 4KB randomly from SATA SSD 500 us | DC S3510 SATA SSD (QOS 99.9%) 
Round trip within same datacenter 500 us | One-way ping is ~250us 

Read 1MB sequentially from SATA SSD | 2 ms ~550MB/s DC $3510 SATA SSD 
Read 1MB sequentially from disk 5 ms ~200MB/s server HDD 

Random Disk Access (seek+rotation) 10 ms 

Send packet CA->Netherlands->CA 150 ms 

















Table 12.4.2: Latency Numbers for NVIDIA Tesla GPUs. 
































Action Time Notes 

GPU Shared Memory access 30ns | 30-90 cycles (bank conflicts add latency) 
GPU Global Memory access 200 ns | 200-800 cycles 

Launch CUDA kernel on GPU 10 us | Host CPU instructs GPU to start kernel 
Transfer 1MB to/from NVLink GPU | 30 us | -33GB/s on NVIDIA 40GB NVLink 
Transfer 1MB to/from PCI-E GPU 80 us | ~12GB/s on PCI-Express x16 link 








162 https://gist.github.com/eshelman 
170 https://gist.github.com/eshelman/343a1c46cb3fba142c1afdcdeec17646 
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Exercises 


10. 


11. 


12. 


13. 


14. 


15. 


. Write C code to test whether there is any difference in speed between accessing memory 


aligned or misaligned relative to the external memory interface. Hint: be careful of caching 
effects. 


. Test the difference in speed between accessing memory in sequence or with a given stride. 
. How could you measure the cache sizes on a CPU? 


. How would you lay out data across multiple memory channels for maximum bandwidth? 


How would you lay it out if you had many small threads? 


. An enterprise class HDD is spinning at 10,000 rpm. What is the absolutely minimum time 


an HDD needs to spend worst case before it can read data (you can assume that heads move 
almost instantaneously)? Why are 2.5” HDDs becoming popular for commercial servers (rel- 
ative to 3.5” and 5.25” drives)? 


. Assume that an HDD manufacturer increases the storage density from 1 Tbit per square inch 


to 5 Tbit per square inch. How much information can you store on a ring on a 2.5” HDD? Is 
there a difference between the inner and outer tracks? 


. The AWS P2 instances have 16 K80 Kepler GPUs. Use 1spci on a p2.16xlarge and a p2.8xlarge 


instance to understand how the GPUs are connected to the CPUs. Hint: keep your eye out 
for PCI PLX bridges. 


. Going from 8 bit to 16 bit datatypes increases the amount of silicon approximately by 4x. 


Why? Why might NVIDIA have added INT4 operations to their Turing GPUs. 


. Given 6 high speed links between GPUs (such as for the Volta V100 GPUs), how would you 


connect 8 of them? Look up the connectivity used in the P3.16xlarge servers. 


How much faster is it to read forward through memory vs. reading backwards? Does this 
number differ between different computers and CPU vendors? Why? Write C code and ex- 
periment with it. 


Can you measure the cache size of your disk? What is it for a typical HDD? Do SSDs need a 
cache? 


Measure the packet overhead when sending messages across the Ethernet. Look up the dif- 
ference between UDP and TCP/IP connections. 


Direct Memory Access allows devices other than the CPU to write (and read) directly to 
(from) memory. Why is this a good idea? 


Look at the performance numbers for the Turing T4 GPU. Why does the performance ‘only’ 
double as you go from FP16 to INT8 and INT4? 


What is the shortest time it should take for a packet on a roundtrip between San Francisco 
and Amsterdam? Hint: you can assume that the distance is 10,000km. 


Discussions!”! 





17 https://discuss.d21.ai/t/363 
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12.5 Training on Multiple GPUs 


So far we discussed how to train models efficiently on CPUs and GPUs. We even showed how deep 
learning frameworks such as MXNet (and TensorFlow) allow one to parallelize computation and 
communication automatically between them in Section 12.3. Lastly, we showed in Section 5.6 
how to list all available GPUs on a computer using nvidia-smi. What we did not discuss is how to 
actually parallelize deep learning training (we omit any discussion of inference on multiple GPUs 
here as it is a rather rarely used and advanced topic that goes beyond the scope of this book). 
Instead, we implied in passing that one would somehow split the data across multiple devices and 
make it work. The present section fills in the details and shows how to train a network in parallel 
when starting from scratch. Details on how to take advantage of functionality in Gluon is relegated 
to Section 12.6. We assume that the reader is familiar with minibatch SGD algorithms such as the 
ones described in Section 11.5. 


12.5.1 Splitting the Problem 


Let us start with a simple computer vision problem and a slightly archaic network, e.g., with mul- 
tiple layers of convolutions, pooling, and possibly a few dense layers in the end. That is, let us 
start with a network that looks quite similar to LeNet (LeCun et al., 1998) or AlexNet (Krizhevsky 
et al., 2012). Given multiple GPUs (2 if it is a desktop server, 4 on a g4dn.12xlarge, 8 on an AWS 
p3.16xlarge, or 16 on a p2.16xlarge), we want to partition training in a manner as to achieve good 
speedup while simultaneously benefitting from simple and reproducible design choices. Multi- 
ple GPUs, after all, increase both memory and compute ability. In a nutshell, we have a number of 
choices, given a minibatch of training data that we want to classify. 


e We could partition the network layers across multiple GPUs. That is, each GPU takes as input 
the data flowing into a particular layer, processes data across a number of subsequent layers 
and then sends the data to the next GPU. 


- This allows us to process data with larger networks when compared to what a single 
GPU could handle. 


- Memory footprint per GPU can be well controlled (it is a fraction of the total network 
footprint) 


- The interface between layers (and thus GPUs) requires tight synchronization. This can 
be tricky, in particular if the computational workloads are not properly matched be- 
tween layers. The problem is exacerbated for large numbers of GPUs. 


- The interface between layers requires large amounts of data transfer (activations, gra- 
dients). This may overwhelm the bandwidth of the GPU buses. 


- Compute intensive, yet sequential operations are nontrivial to partition. See e.g., 
(Mirhoseini et al., 2017) for a best effort in this regard. It remains a difficult problem 
and it is unclear whether it is possible to achieve good (linear) scaling on nontrivial 
problems. We do not recommend it unless there is excellent framework / OS support 
for chaining together multiple GPUs. 


e We could split the work required by individual layers. For instance, rather than computing 
64 channels on a single GPU we could split up the problem across 4 GPUs, each of which 
generate data for 16 channels. Likewise, for a dense layer we could split the number of 
output neurons. Fig. 12.5.1 illustrates this design. The figure is taken from (Krizhevsky et al., 
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2012) where this strategy was used to deal with GPUs that had a very small memory footprint 
(2GB at the time). 


- This allows for good scaling in terms of computation, provided that the number of chan- 
nels (or neurons) is not too small. 


- Multiple GPUs can process increasingly larger networks since the memory available 
scales linearly. 


- We need a very large number of synchronization / barrier operations since each layer 
depends on the results from all other layers. 


- The amount of data that needs to be transferred is potentially even larger than when 
distributing layers across GPUs. We do not recommend this approach due to its band- 
width cost and complexity. 
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Fig. 12.5.1: Model parallelism in the original AlexNet design due to limited GPU memory. 


e Lastly we could partition data across multiple GPUs. This way all GPUs perform the same 
type of work, albeit on different observations. Gradients are aggregated between GPUs after 
each minibatch. 


- This is the simplest approach and it can be applied in any situation. 


- Adding more GPUs does not allow us to train larger models. 


We only need to synchronize after each minibatch. That said, it is highly desirable to 
start exchanging gradients parameters already while others are still being computed. 


Large numbers of GPUs lead to very large minibatch sizes, thus reducing training effi- 
ciency. 


By and large, data parallelism is the most convenient way to proceed, provided that we have ac- 
cess to GPUs with sufficiently large memory. See also (Li et al., 2014) for a detailed description of 
partitioning for distributed training. GPU memory used to be a problem in the early days of deep 
learning. By now this issue has been resolved for all but the most unusual cases. We focus on data 
parallelism in what follows. 





12.5. Training on Multiple GPUs 543 


12.5.2 Data Parallelism 


Assume that there are k GPUs on a machine. Given the model to be trained, each GPU will maintain 
a complete set of model parameters independently. Training proceeds as follows (see Fig. 12.5.2 
for details on data parallel training on two GPUs). 










Local 
gradient 






Model Mini-batch 
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Local 
gradient 





Mini-batch 
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Fig. 12.5.2: Calculation of minibatch stochastic gradient using data parallelism and two GPUs. 


In any iteration of training, given a random minibatch, we split the examples in the batch 
into k portions and distribute them evenly across the GPUs. 


Each GPU calculates loss and gradient ofthe model parameters based on the minibatch sub- 
set it was assigned and the model parameters it maintains. 


The local gradients of each of the k GPUs are aggregated to obtain the current minibatch 
stochastic gradient. 


The aggregate gradient is re-distributed to each GPU. 


Each GPU uses this minibatch stochastic gradient to update the complete set of model pa- 
rameters that it maintains. 


A comparison of different ways of parallelization on multiple GPUs is depicted in Fig. 12.5.3. Note 
thatin practice we increase the minibatch size k-fold when training on k GPUs such that each GPU 
has the same amount of work to do as if we were training on a single GPU only. On a 16 GPU server 
this can increase the minibatch size considerably and we may have to increase the learning rate 
accordingly. Also note that Section 7.5 needs to be adjusted (e.g., by keeping a separate batch 
norm coefficient per GPU). In what follows we will use Section 6.6 as the toy network to illustrate 
multi-GPU training. As always we begin by importing the relevant packages and modules. 
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Fig. 12.5.3: Parallelization on multiple GPUs. From left to right - original problem, network parti- 
tioning, layer partitioning, data parallelism. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, np, npx 
npx.set_np() 


12.5.3 A Toy Network 


We use LeNet as introduced in Section 6.6. We define it from scratch to illustrate parameter ex- 
change and synchronization in detail. 


# Initialize model parameters 

scale = 0.01 

W1 = np.random.normal(scale=scale, size=(20, 1, 3, 3)) 
b1 = np.zeros(20) 

W2 = np.random.normal(scale=scale, size=(50, 20, 5, 5)) 
b2 = np.zeros(50) 

W3 = np.random.normal(scale=scale, size=(800, 128)) 

b3 = np.zeros(128) 

W4 = np.random.normal(scale=scale, size=(128, 10)) 

b4 = np.zeros(10) 

params = [W1, b1, W2, b2, W3, b3, W4, b4] 


# Define the model 
def lenet(X, params): 
h1_conv = npx.convolution(data=X, weight=params[0], bias=params[1], 
kernel=(3, 3), num_filter=20) 
hl_activation = npx.relu(h1_conv) 
h1 = npx.pooling(data=hl_activation, pool_type='avg', kernel=(2, 2), 
stride=(2, 2)) 


(continues on next page) 
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(continued from previous page) 


h2_conv = npx.convolution(data=h1, weight=params[2], bias=params[3], 
kernel=(5, 5), num_filter=50) 

h2_activation = npx.relu(h2_conv) 

h2 = npx.pooling(data=h2_activation, pool_type='avg', kernel=(2, 2), 
stride=(2, 2)) 

h2 = h2.reshape(h2.shape[0], -1) 

h3_linear = np.dot(h2, params[4]) + params[5] 

h3 = npx.relu(h3_linear) 

y_hat = np.dot(h3, params[6]) + params[7] 

return y_hat 


# Cross-entropy loss function 
loss = gluon. loss.SoftmaxCrossEntropyLoss() 


12.5.4 Data Synchronization 


For efficient multi-GPU training we need two basic operations: firstly we need to have the ability 
to distribute a list of parameters to multiple devices and to attach gradients (get_params). Without 
parameters it is impossible to evaluate the network on a GPU. Secondly, we need the ability to sum 
parameters across multiple devices, i.e., we need an allreduce function. 


def get_params(params, device): 
new_params = [p.copyto(device) for p in params] 
for p in new_params: 
p.attach_grad() 
return new_params 


Let us try it out by copying the model parameters of lenet to gpu(0). 


new_params = get_params(params, d21.try_gpu(0)) 
print('b1 weight: ', new_params[1]) 
print('b1 grad:’, new_params[1].grad) 


bl weight: [0. 0. 0. 0. 0.0. 0. 0. 0. 0.0. 0.0. 0. 0. 0. 0. 0. 0. 0.] @gpu(e) 
bl grad: [0. 0. 0. 0. 0.0. 0.0.0. 0.0. 0.0. 0. 0. 0. 0. 0. 0. 0.] @gpu(e) 


Since we didn't perform any computation yet, the gradient with regard to the bias weights is still 0. 
Now let us assume that we have a vector distributed across multiple GPUs. The following allre- 
duce function adds up all vectors and broadcasts the result back to all GPUs. Note that for this to 
work we need to copy the data to the device accumulating the results. 


def allreduce(data): 
for i in range(1, len(data)): 
data[l01[:] += dataLli].copyto(data[@].ctx) 
for i in range(1, len(data)): 
data[l0].copyto(data[i]) 


Let us test this by creating vectors with different values on different devices and aggregate them. 
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data = [np.ones((1, 2), ctx=d21.try_gpu(i)) * (i + 1) for i in range(2)] 
print('before allreduce:1n', data[0], ‘\n’, data[1]) 

allreduce(data) 

print('after allreduce:\n', data[0], '\n', data[1]) 


before allreduce: 
[C1. 1.11 @gpu(@) 
[[2. 2.1] @gpu(1) 
after allreduce: 

[[3. 3.11 @gpu(@) 
[[3. 3.]] @gpu(1) 


12.5.5 Distributing Data 


We need a simple utility function to distribute a minibatch evenly across multiple GPUs. For in- 
stance, on 2 GPUs we'd like to have half of the data to be copied to each of the GPUs. Since it is 
more convenient and more concise, we use the built-in split and load function in Gluon (to try it 
out on a 4 x 5 matrix). 


data = np.arange(20).reshape(4, 5) 

devices = [npx.gpu(0), npx.gpu(1)] 

split = gluon.utils.split_and_load(data, devices) 
print('input :', data) 

print('load into’, devices) 

print('output:', split) 


inpute MIOS 4eall 
LD. Ga e Be Dell 
E, Wil, I. as. 14,1 
EIS, Mae AS WY. Ii 
load into [gpu(2), gpu(1)] 
output: [array([[0., 1., 2., 3., 4.1, 
Dag Gos Top Bog 2l CSE); arre (bid. Ll Iag Mog 14a, 
Pibes Gop Iep Ios 10] ctx-cpucd)) | 


For later reuse we define a split_batch function which splits both data and labels. 


#@save 
def split_batch(X, y, devices): 
"""Split *X' and ‘y* into multiple devices. 
assert X.shape[l0] == y.shape[0] 
return (gluon.utils.split_and_load(X, devices), 
gluon.utils.split_and_load(y, devices)) 


nnn 
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12.5.6 Training 


Now we can implement multi-GPU training on a single minibatch. Its implementation is primarily 
based on the data parallelism approach described in this section. We will use the auxiliary func- 
tions we just discussed, allreduce and split_and_load, to synchronize the data among multiple 
GPUs. Note that we do not need to write any specific code to achieve parallelism. Since the com- 
putational graph does not have any dependencies across devices within a minibatch, itis executed 
in parallel automatically. 


def train_batch(X, y, device_params, devices, Ir): 
X_shards, y_shards = split_batch(X, y, devices) 
with autograd.record(): + Loss is calculated separately on each GPU 
losses = [loss(lenet(X_shard, device_W), y_shard) 
for X_shard, y_shard, device_W in zip( 
X_shards, y_shards, device_params) ] 
for 1 in losses: # Back Propagation is performed separately on each GPU 
1.backward() 
# Sum all gradients from each GPU and broadcast them to all GPUs 
for i in range(len(device_params[0])): 
allreduce([device_params[c][il.grad for c in range(len(devices))]) 
# The model parameters are updated separately on each GPU 
for param in device_params: 
d21.sgd(param, lr, X.shape[@]) + Here, we use a full-size batch 


Now, we can define the training function. It is slightly different from the ones used in the previous 
chapters: we need to allocate the GPUs and copy all the model parameters to all devices. Obvi- 
ously each batch is processed using train_batch to deal with multiple GPUs. For convenience 
(and conciseness of code) we compute the accuracy on a single GPU (this is inefficient since the 
other GPUs are idle). 


def train(num_gpus, batch_size, Ilr): 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
devices = [d21.try_gpu(i) for i in range(num_gpus) ] 
# Copy model parameters to num_gpus GPUs 
device_params = [get_params(params, d) for d in devices] 
# num_epochs, times, acces = 10, [], L] 
num_epochs = 10 
animator = d21.Animator('epoch', ‘test acc’, xlim=[1, num_epochs]) 
timer = d21.Timer() 
for epoch in range(num_epochs): 
timer.start() 
for X, y in train_iter: 
# Perform multi-GPU training for a single minibatch 
train_batch(X, y, device_params, devices, Ilr) 
npx.waitall() 
timer.stop() 
# Verify the model on GPU Q 
animator.add(epoch + 1, (d21.evaluate_accuracy_gpu( 
lambda x: lenet(x, device_params[0]), test_iter, devices[@]),)) 
print(f'test acc: {animator.Y[Q@][-1]:.2f}, {timer.avg():.1f} sec/epoch ' 
f'on {str(devices) }’) 
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12.5.7 Experiment 


Let us see how well this works on a single GPU. We use a batch size of 256 and a learning rate of 
0.2. 


train(num_gpus=1, batch_size=256, 1r=0.2) 


test acc: 0.85, 3.1 sec/epoch on [gpu(@)] 
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By keeping the batch size and learning rate unchanged and changing the number of GPUs to 2, 
we can see that the improvement in test accuracy is roughly the same as in the results from the 
previous experiment. In terms of the optimization algorithms, they are identical. Unfortunately 
there is no meaningful speedup to be gained here: the model is simply too small; moreover we only 
have a small dataset, where our slightly unsophisticated approach to implementing multi-GPU 
training suffered from significant Python overhead. We will encounter more complex models and 
more sophisticated ways of parallelization going forward. Let us see what happens nonetheless 
for Fashion-MNIST. 


train(num_gpus=2, batch_size=256, 1r=0.2) 


test acc: 0.83, 5.8 sec/epoch on [gpu(0), gpu(1)] 
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Summary 


+ There are multiple ways to split deep network training over multiple GPUs. We could split 
them between layers, across layers, or across data. The former two require tightly chore- 
ographed data transfers. Data parallelism is the simplest strategy. 


e Data parallel training is straightforward. However, it increases the effective minibatch size 
to be efficient. 


e Data is split across multiple GPUs, each GPU executes its own forward and backward opera- 
tion and subsequently gradients are aggregated and results broadcast back to the GPUs. 


* Large minibatches may require a slightly increased learning rate. 


Exercises 


1. When training on multiple GPUs, change the minibatch size from bto k- b, i.e., scale it up by 
the number of GPUs. 


2. Compare accuracy for different learning rates. How does it scale with the number of GPUs. 


3. Implementa more efficient allreduce that aggregates different parameters on different GPUs 
(why is this more efficient in the first place). 


4. Implement multi-GPU test accuracy computation. 


Discussions??? 


12.6 Concise Implementation for Multiple GPUs 


Implementing parallelism from scratch for every new model is no fun. Moreover, there is signif- 
icant benefit in optimizing synchronization tools for high performance. In the following we will 
show how to do this using Gluon. The math and the algorithms are the same as in Section 12.5. As 
before we begin by importing the required modules (quite unsurprisingly you will need at least 
two GPUs to run this notebook). 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 
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12.6.1 A Toy Network 


Let us use a slightly more meaningful network than LeNet from the previous section that’s still 
sufficiently easy and quick to train. We pick a ResNet-18 variant (He et al., 2016a). Since the input 
images are tiny we modify it slightly. In particular, the difference to Section 7.6 is that we use a 
smaller convolution kernel, stride, and padding at the beginning. Moreover, we remove the max- 
pooling layer. 


#@save 
def resnet18(num_classes): 
"*"A slightly modified ResNet-18 model.””" 
def resnet_block(num_channels, num_residuals, first_block=False): 
blk = nn.Sequential() 
for i in range(num_residuals): 
if i == 0 and not first_block: 
blk. add(d21.Residual ( 
num_channels, use_1x1conv=True, strides=2)) 
else: 
b1k.add(d21.Residual (num_channels)) 
return blk 


net = nn.Sequential() 
# This model uses a smaller convolution kernel, stride, and padding and 
# removes the maximum pooling layer 
net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1), 
nn.BatchNorm(), nn.Activation('relu')) 
net.add(resnet_block(64, 2, first_block=True), 
resnet_block(128, 2), 
resnet_block(256, 2), 
resnet_block(512, 2)) 
net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes)) 
return net 


12.6.2 Parameter Initialization and Logistics 


The initialize method allows us to set initial defaults for parameters on a device of our choice. 
For a refresher see Section 4.8. What is particularly convenient is that it also lets us initialize the 
network on multiple devices simultaneously. Let us try how this works in practice. 


net = resnet18(10) 

# get a list of GPUs 

devices = d21.try_all_gpus() 

# initialize the network on all of them 
net.initialize(init=init.Normal(sigma=0.01), ctx=devices) 


Using the split_and_load function introduced in the previous section we can divide a minibatch 
of data and copy portions to the list of devices provided by the context variable. The network 
object automatically uses the appropriate GPU to compute the value of the forward propagation. 
As before we generate 4 observations and split them over the GPUs. 


x = np.random.uniform(size=(4, 1, 28, 28)) 
x_shards = gluon.utils.split_and_load(x, devices) 
net(x_shards[0]), net(x_shards[1]) 
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(array(CL 2. 


2610193e-06, 2. 


2045974e-06, -5.4046782e-06, 1.2869954e-06, 


5.1373149e-06, -3.8298003e-06, 1.4339014e-07, 5.4683451e-06, 
-2.8279194e-06, -3.9651113e-06], 
[ 2.0698667e-06, 2.0084665e-06, -5.6382501e-06, 1.0498469e-06, 
5.5506416e-06, -4.1065468e-06, 6.0830143e-07, 5.4521765e-06, 
-3.7365030e-06, -4.1891640e-06]], ctx=gpu(0)), 
array(L[ 2.4629794e-06, 2.6015521e-06, -5.4362622e-06, 1.2938231e-06, 
5.6387898e-06, -4.1360104e-06, 3.5758922e-07, 5.5125238e-06, 
-3.1957329e-06, -4.2976321e-06], 
[ 1.9431675e-06, 2.2600425e-06, -5.2698206e-06, 1.4807410e-06, 
5.4830930e-06, -3.9678889e-06, 7.5752268e-08, 5.6764361e-06, 
-3.2530238e-06, -4.0943960e-06]], ctx=gpu(1))) 


Once data passes through the network, the corresponding parameters are initialized on the device 
the data passed through. This means that initialization happens on a per-device basis. Since we 
picked GPU 0 and GPU 1 for initialization, the network is initialized only there, and not on the 
CPU. In fact, the parameters do not even exist on the device. We can verify this by printing out the 
parameters and observing any errors that might arise. 


weight = net[0].params.get('weight') 


¡Eye 

weight.data() 
except RuntimeError: 

print('not initialized on cpu’) 
weight.data(devices[01)[0], weight.data(devices[1])L0] 


not initialized on cpu 


(array([L[ 0.01382882, -0.01183044, 0.01417866], 

[-0.00319718, 0.00439528, 0.02562625], 

[-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(0)), 
array(L[[ 0.01382882, -0.01183044, 0.01417866], 

[-0.00319718, 0.00439528, 0.02562625], 

[-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(1))) 


Lastly let us replace the code to evaluate the accuracy by one that works in parallel across multiple 
devices. This serves as a replacement of the evaluate_accuracy_gpu function from Section 6.6. 
The main difference is that we split a batch before invoking the network. All else is essentially 
identical. 


#@save 
def evaluate_accuracy_gpus(net, data_iter, split_f=d21.split_batch): 
# Query the list of devices 
devices = list(net.collect_params().values())[0].list_ctx() 
metric = d21.Accumulator(2) + num_corrected_examples, num_examples 
for features, labels in data_iter: 
X_shards, y_shards = split_f(features, labels, devices) 
# Run in parallel 
pred_shards = [net(X_shard) for X_shard in X_shards] 
metric.add(sum(float(d21.accuracy(pred_shard, y_shard)) for 
pred_shard, y_shard in zip( 


(continues on next page) 
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(continued from previous page) 


pred_shards, y_shards)), labels.size) 
return metric[0] / metric[1] 


12.6.3 Training 


As before, the training code needs to perform a number of basic functions for efficient parallelism: 
e Network parameters need to be initialized across all devices. 
e While iterating over the dataset minibatches are to be divided across all devices. 
e We compute the loss and its gradient in parallel across devices. 
e Losses are aggregated (by the trainer method) and parameters are updated accordingly. 


In the end we compute the accuracy (again in parallel) to report the final value of the network. 
The training routine is quite similar to implementations in previous chapters, except that we need 
to split and aggregate data. 


def train(num_gpus, batch_size, Ir): 
train_iter, test_iter = d21.load_data_fashion_mnist(batch_size) 
ctx = [d21.try_gpu(i) for i in range(num_gpus) ] 
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True) 
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
£'learning_rate': 1r)) 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 
timer, num_epochs = d21.Timer(), 10 
animator = d21.Animator('epoch', 'test acc’, xlim=[1, num_epochs]) 
for epoch in range(num_epochs): 
timer.start() 
for features, labels in train_iter: 
X_shards, y_shards = d21.split_batch(features, labels, ctx) 
with autograd.record(): 
losses = [loss(net(X_shard), y_shard) for X_shard, y_shard 
in zip(X_shards, y_shards) ] 
for 1 in losses: 
1. backward() 
trainer.step(batch_size) 
npx.waitall() 
timer.stop() 
animator.add(epoch + 1, (evaluate_accuracy_gpus(net, test_iter),)) 
print(f'test acc: {animator.Y[Q@][-1]:.2f}, {timer.avg():.1f} sec/epoch ' 
P Ol SEn CEO) 
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12.6.4 Experiments 
Let us see how this works in practice. As a warmup we train the network on a single GPU. 


train(num_gpus=1, batch_size=256, 1r=0.1) 


test acc: 0.93, 13.2 sec/epoch on [gpu(Q)] 
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Next we use 2 GPUs for training. Compared to LeNet the model for ResNet-18 is considerably 
more complex. This is where parallelization shows its advantage. The time for computation is 
meaningfully larger than the time for synchronizing parameters. This improves scalability since 
the overhead for parallelization is less relevant. 


train(num_gpus=2, batch_size=512, 1r=0.2) 


test acc: 0.92, 6.9 sec/epoch on [gpu(0), gpu(1)] 


0.925 
0.900 
0.875 
0.850 
0.825 
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test acc 





Summary 


e Gluon provides primitives for model initialization across multiple devices by providing a 
context list. 


+ Data is automatically evaluated on the devices where the data can be found. 


e Take care to initialize the networks on each device before trying to access the parameters on 
that device. Otherwise you will encounter an error. 


+ The optimization algorithms automatically aggregate over multiple GPUs. 


Exercises 


1. This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more 
GPUs for computation. What happens if you try this on a p2.16xlarge instance with 16 GPUs? 


2. Sometimes, different devices provide different computing power. We could use the GPUs 
and the CPU at the same time. How should we divide the work? Is it worth the effort? Why? 
Why not? 


3. What happens if we drop npx.waitall()? How would you modify training such that you have 
an overlap of up to two steps for parallelism? 


Discussions!’ 


12.7 Parameter Servers 


As we move from single GPUs to multiple GPUs and then to multiple servers containing multi- 
ple GPUs, possibly all spread out across multiple racks and network switches our algorithms for 
distributed and parallel training need to become much more sophisticated. Details matter since 
different interconnects have very different bandwidth (e.g., NVLink can offer up to 100GB/s across 
6 links in an appropriate setting, PCIe 3.0 16x lanes offer 16GB/s while even high speed 100 GbE 
Ethernet only amounts to 10GB/s). At the same time it is unreasonable to expect that a statistical 
modeler be an expert in networking and systems. 


The core idea of the parameter server was introduced in (Smola & Narayanamurthy, 2010) in the 
context of distributed latent variable models. A description of the push and pull semantics then 
followed in (Ahmed et al., 2012) and a description of the system and an open source library fol- 
lowed in (Li et al., 2014). In the following we will motivate the components needed for efficiency. 
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12.7.1 Data Parallel Training 


Let us review the data parallel training approach to distributed training. We will use this to the 
exclusion of all others in this section since it is significantly simpler to implement in practice. 
There are virtually no use cases (besides deep learning on graphs) where any other strategy for 
parallelism is preferred since GPUs have plenty of memory nowadays. Fig. 12.7.1 describes the 
variant of data parallelism that we implemented in the previous section. The key aspect in it is 
that the aggregation of gradients occurs on GPUO before the updated parameters are rebroadcast 
to all GPUs. 


storage 


data 


CPU 


minibatch 


GPU 


network & 
parameters 


gradient 





single GPU multiple GPUs 


Fig. 12.7.1: Left: single GPU training; Right: a variant of multi-GPU training. It proceeds as fol- 
lows. (1) we compute loss and gradient, (2) all gradients are aggregated on one GPU, (3) parameter 
update happens and the parameters are re-distributed to all GPUs. 


In retrospect, the decision to aggregate on GPUO seems rather ad-hoc. After all, we might just as 
well aggregate on the CPU. In fact, we could even decide to aggregate some of the parameters on 
one GPU and some others on another. Provided that the optimization algorithm supports this, 
there is no real reason for why we could not. For instance, if we have four parameter vectors 


Vi,...,V4 with associated gradients gi,...,g4 we could aggregate the gradients on one GPU each. 
gi= Y 8y (12.7.1) 
jEeGPUs 
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This reasoning seems arbitrary and frivolous. After all, the math is the same throughout. How- 
ever, we are dealing with real physical hardware where different buses have different bandwidth 
as discussed in Section 12.4. Consider a real 4-way GPU server as described in Fig. 12.7.2. If it 
is particularly well connected, it might have a 100 GbE network card. More typical numbers are 
in the 1-10 GbE range with an effective bandwidth of 100MB/s to 1GB/s. Since the CPUs have too 
few PCIe lanes to connect to all GPUs directly (e.g., consumer grade Intel CPUs have 24 lanes) we 
need a multiplexer!’*. The bandwidth from the CPU on a 16x Gen3 link is 16GB/s. This is also the 
speed at which each of the GPUs is connected to the switch. This means that it is more effective to 
communicate between the devices. 










= 
network == 
switch 
10-100 GbE 
(1-10 GB/s) 
CPU 
PCle 3.0 16x 
(16 GB/s) 
PCle 
switch 
4x PCle 3.0 16x 
(64 GB/s) 
GPU 


Fig. 12.7.2: A 4-way GPU server. 


For the sake of the argument let us assume that the gradients ‘weight’ 160MB. In this case it takes 
30ms to send the gradients from all 3 remaining GPUs to the fourth one (each transfer takes 10ms 
= 160MB / 16 GB/s). Add another 30ms to transmit the weight vectors back we arrive at a total of 
60ms. If we send all data to the CPU we incur a penalty of 40ms since each of the four GPUs needs 
to send the data to the CPU, yielding a total of 80ms. Lastly assume that we are able to split the 
gradients into 4 parts of 40MB each. Now we can aggregate each of the parts on a different GPU 
simultaneously since the PCIe switch offers a full-bandwidth operation between all links. Instead 
of 30ms this takes 7.5ms, yielding a total of 15ms for a synchronization operation. In short, de- 
pending on how we synchronize parameters the same operation can take anywhere from 15ms to 
80ms. Fig. 12.7.3 depicts the different strategies for exchanging parameters. 
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Fig. 12.7.3: Synchronization strategies. 


Note that we have yet another tool at our disposal when it comes to improving performance: ina 
deep network it takes some time to compute all gradients from the top to the bottom. We can begin 
synchronizing gradients for some parameter groups even while we are still busy computing them 
for others (the technical details for that are somewhat involved). See e.g., (Sergeev & DelBalso, 
2018) for details on how to do this in Horovod!”. 


12.7.2 Ring Synchronization 


When it comes to synchronization on modern deep learning hardware we often encounter sig- 
nificantly bespoke network connectivity. For instance, the AWS P3.16xlarge and NVIDIA DGX-2 
instances share the connectivity structure of Fig. 12.7.4. Each GPU connects to a host CPU via a 
PCIe link which operates at best at 16 GB/s. Additionally each GPU also has 6 NVLink connections, 
each of which is capable of transferring 300 Gbit/s bidirectionally. This amounts to around 18 GB/s 
per link per direction. In short, the aggregate NVLink bandwidth is significantly higher than the 
PCIe bandwidth. The question is how to use it most efficiently. 
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Fig. 12.7.4: NVLink connectivity on 8GPU V100 servers (image courtesy of NVIDIA). 


It turns out (Wang et al., 2018) that the optimal synchronization strategy is to decompose the net- 
work into two rings and to use them to synchronize data directly. Fig. 12.7.5 illustrates that the 
network can be decomposed into one ring (1-2-3-4-5-6-7-8-1) with double NVLink bandwidth and 
into one (1-4-6-3-5-8-2-7-1) with regular bandwidth. Designing an efficient synchronization proto- 
col in this case is nontrivial. 
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Fig. 12.7.5: Decomposition of the NVLink network into two rings. 


Consider the following thought experiment: given a ring of n compute nodes (or GPUs) we can 
send gradients from the first to the second node. There itis added to the local gradient and sent 
on to the third node, and so on. After n — 1 steps the aggregate gradient can be found in the last- 
visited node. That is, the time to aggregate gradients grows linearly with the number of nodes. 
But if we do this the algorithm is quite inefficient. After all, at any time there is only one of the 
nodes communicating. What if we broke the gradients into n chunks and started synchronizing 
chunk iż starting at node i. Since each chunk is of size 1/n the total time is now (n — 1)/n = 1. In 
other words, the time spent to aggregate gradients does not grow as we increase the size of the ring. 
This is quite an astonishing result. Fig. 12.7.6 illustrates the sequence of steps on n = 4 nodes. 
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Fig. 12.7.6: Ring synchronization across 4 nodes. Each node starts transmitting parts of gradients 
to its left neighbor until the assembled gradient can be found in its right neighbor. 


If we use the same example of synchronizing 160MB across 8 V100 GPUs we arrive at approximately 
2-160MB/(3-18GB/s) = 6ms This is quite a bit better than using the PCle bus, even though we are 
now using 8 GPUs. Note that in practice these numbers are quite a bit worse, since deep learning 
frameworks often fail to assemble communication into large burst transfers. Moreover, timing is 
critical. Note that there is a common misconception that ring synchronization is fundamentally 
different from other synchronization algorithms. The only difference is that the synchronization 
path is somewhat more elaborate when compared to a simple tree. 


12.7.3 Multi-Machine Training 


Distributed training on multiple machines adds a further challenge: we need to communicate 
with servers that are only connected across a comparatively lower bandwidth fabric which can be 
over an order of magnitude slower in some cases. Synchronization across devices is tricky. After 
all, different machines running training code will have subtly different speed. Hence we need to 
synchronize them if we want to use synchronous distributed optimization. Fig. 12.7.7 illustrates 
how distributed parallel training occurs. 


1. A (different) batch of data is read on each machine, split across multiple GPUs and trans- 
ferred to GPU memory. There predictions and gradients are computed on each GPU batch 
separately. 


2. The gradients from all local GPUs are aggregated on one GPU (or alternatively parts of it are 
aggregated over different GPUs. 


3. The gradients are sent to the CPU. 


4. The CPU sends the gradients to a central parameter server which aggregates all the gradi- 
ents. 
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5. The aggregate gradients are then used to update the weight vectors and the updated weight 
vectors are broadcast back to the individual CPUs. 


6. The information is sent to one (or multiple) GPUs. 


7. The updated weight vectors are spread across all GPUs. 
parameter >| E y 
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Fig. 12.7.7: Multi-machine multi-GPU distributed parallel training. 


Each of these operations seems rather straightforward. And, indeed, they can be carried out ef- 
ficiently within a single machine. Once we look at multiple machines, though, we can see that 
the central parameter server becomes the bottleneck. After all, the bandwidth per server is lim- 
ited, hence for m workers the time it takes to send all gradients to the server is O(m). We can 
break through this barrier by increasing the number of servers to n. At this point each server only 
needs to store O(1/n) of the parameters, hence the total time for updates and optimization be- 
comes O(m/n). Matching both numbers yields constant scaling regardless of how many workers 
we are dealing with. In practice we use the same machines both as workers and as servers. Fig. 
12.7.8 illustrates the design. See also (Li et al., 2014) for details. In particular, ensuring that mul- 
tiple machines work without unreasonable delays is nontrivial. We omit details on barriers and 
will only briefly touch on synchronous and asynchronous updates below. 
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single parameter server 





multiple servers 


Fig. 12.7.8: Top - a single parameter server is a bottleneck since its bandwidth is finite. Bottom - 
multiple parameter servers store parts of the parameters with aggregate bandwidth. 


12.7.4 (key,value) Stores 


Implementing the steps required for distributed multi-GPU training in practice is nontrivial. In 
particular, given the many different choices that we might encounter. This is why it pays to use a 
common abstraction, namely that of a (key,value) store with redefined update semantics. Across 
many servers and many GPUs the gradient computation can be defined as 


B= > Y Esp (12.7.2) 


keworkers ¡eGPUs 


The key aspect in this operation is that it is a commutative reduction, that is, it turns many vectors 
into one and the order in which the operation is applied does not matter. This is great for our 
purposes since we do not (need to) have fine grained control over when which gradientis received. 
Note that it is possible for us to perform the reduction stagewise. Furthermore, note that this 
operation is independent between blocks i pertaining to different parameters (and gradients). 


This allows us to define the following two operations: push, which accumulates gradients, and 
pull, which retrieves aggregate gradients. Since we have many different sets of gradients (after 
all, we have many layers), we need to index the gradients with a key i. This similarity to (key,value) 
stores, such as the one introduced in Dynamo (DeCandia et al., 2007) is not by coincidence. They, 





12.7. Parameter Servers 563 


too, satisfy many similar characteristics, in particular when it comes to distributing the parame- 
ters across multiple servers. 


push(key, value) sends a particular gradient (the value) from a worker to a common storage. 
There the parameter is aggregated, e.g., by summing it up. 


pull(key, value) retrieves an aggregate parameter from common storage, e.g., after com- 
bining the gradients from all workers. 


By hiding all the complexity about synchronization behind a simple push and pull operation we 
can decouple the concerns of the statistical modeler who wants to be able to express optimization 
in simple terms and the systems engineer who needs to deal with the complexity inherent in dis- 
tributed synchronization. In the next section we will experiment with such a (key,value) store in 
practice. 


Summary 


Synchronization needs to be highly adaptive to specific network infrastructure and connec- 
tivity within a server. This can make a significant difference to the time it takes to synchro- 
nize. 


Ring-synchronization can be optimal for P3 and DGX-2 servers. For others possibly not so 
much. 


A hierarchical synchronization strategy works well when adding multiple parameter servers 
for increased bandwidth. 


Asynchronous communication (while computation is still ongoing) can improve perfor- 
mance. 


Exercises 


2 
3 
4, 
5 
6 


. Can you increase the ring synchronization even further? Hint: you can send messages in 


both directions. 


. Fully asynchronous. Some delays permitted? 


. Fault tolerance. How? What if we lose a server? Is this a problem? 


Checkpointing 


. Tree aggregation. Can you do it faster? 


. Other reductions (commutative semiring). 


Discussions!”® 
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13 Computer Vision 


Many applications in the area of computer vision are closely related to our daily lives, now and in 
the future, whether medical diagnostics, driverless vehicles, camera monitoring, or smart filters. 
In recent years, deep learning technology has greatly enhanced computer vision systems' perfor- 
mance. It can be said that the most advanced computer vision applications are nearly inseparable 
from deep learning. 


We have introduced deep learning models commonly used in the area of computer vision in the 
chapter “Convolutional Neural Networks” and have practiced simple image classification tasks. In 
this chapter, we will introduce image augmentation and fine tuning methods and apply them to 
image classification. Then, we will explore various methods of object detection. After that, we 
will learn how to use fully convolutional networks to perform semantic segmentation on images. 
Then, we explain how to use style transfer technology to generate images that look like the cover of 
this book. Finally, we will perform practice exercises on two important computer vision datasets 
to review the content of this chapter and the previous chapters. 


13.1 Image Augmentation 


We mentioned that large-scale datasets are prerequisites for the successful application of deep 
neural networks in Section 7.1. Image augmentation technology expands the scale of training 
datasets by making a series of random changes to the training images to produce similar, but dif- 
ferent, training examples. Another way to explain image augmentation is that randomly changing 
training examples can reduce a model's dependence on certain properties, thereby improving its 
capability for generalization. For example, we can crop the images in different ways, so that the 
objects of interest appear in different positions, reducing the model’s dependence on the posi- 
tion where objects appear. We can also adjust the brightness, color, and other factors to reduce 
model's sensitivity to color. It can be said that image augmentation technology contributed greatly 
to the success of AlexNet. In this section, we will discuss this technology, which is widely used in 
computer vision. 


First, import the packages or modules required for the experiment in this section. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, image, init, np, npx 
from mxnet.gluon import nn 


npx.set_np() 
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13.1.1 Common Image Augmentation Method 


In this experiment, we will use an image with a shape of 400 x 500 as an example. 


d21.set_figsize() 
img = image.imread(’../img/catl. jpg’) 
d21.plt.imshow(img.asnumpy()); 
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Most image augmentation methods have a certain degree of randomness. To make it easier for 
us to observe the effect of image augmentation, we next define the auxiliary function apply. This 
function runs the image augmentation method aug multiple times on the input image img and 
shows all results. 


def apply(img, aug, num_rows=2, num_cols=4, scale=1.5): 
Y = [aug(img) for _ in range(num_rows * num_cols) ] 
d21.show_images(Y, num_rows, num_cols, scale=scale) 


Flipping and Cropping 


Flipping the image left and right usually does not change the category of the object. This is one 
of the earliest and most widely used methods of image augmentation. Next, we use the trans- 
forms module to create the RandomFlipLeftRight instance, which introduces a 50% chance that 
the image is flipped left and right. 


apply(img, gluon.data.vision.transforms.RandomFlipLeftRight()) 
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Flipping up and down is not as commonly used as flipping left and right. However, at least for this 
example image, flipping up and down does not hinder recognition. Next, we create a RandomF lip- 
TopBottom instance for a 50% chance of flipping the image up and down. 


apply(img, gluon.data.vision.transforms.RandomFlipTopBottom()) 





In the example image we used, the cat is in the middle of the image, but this may not be the case 
for all images. In Section 6.5, we explained that the pooling layer can reduce the sensitivity of the 
convolutional layer to the target location. In addition, we can make objects appear at different 
positions in the image in different proportions by randomly cropping the image. This can also 
reduce the sensitivity of the model to the target position. 


In the following code, we randomly crop a region with an area of 10% to 100% of the original 
area, and the ratio of width to height of the region is randomly selected from between 0.5 and 2. 
Then, the width and height of the region are both scaled to 200 pixels. Unless otherwise stated, the 
random number between a and b in this section refers to a continuous value obtained by uniform 
sampling in the interval |a, b]. 


shape_aug = gluon.data.vision.transforms.RandomResizedCrop( 
(200, 200), scale=(0.1, 1), ratio=(0.5, 2)) 
apply(img, shape_aug) 
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Changing the Color 
Another augmentation method is changing colors. We can change four aspects of the image color: 


brightness, contrast, saturation, and hue. In the example below, we randomly change the bright- 
ness of the image to a value between 50% (1 — 0.5) and 150% (1 + 0.5) of the original image. 


apply(img, gluon.data.vision.transforms.RandomBrightness(0.5)) 








Similarly, we can randomly change the hue of the image. 


apply(img, gluon.data.vision.transforms.RandomHue(0.5)) 
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We can also create a RandomColorJitter instance and set how to randomly change the brightness, 
contrast, saturation, and hue of the image at the same time. 


color_aug = gluon.data.vision.transforms.RandomColorJitter( 
brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5) 
apply(img, color_aug) 





Overlying Multiple Image Augmentation Methods 


In practice, we will overlay multiple image augmentation methods. We can overlay the different 
image augmentation methods defined above and apply them to each image by using a Compose 
instance. 


augs = gluon.data.vision.transforms.Compose([ 
gluon.data.vision.transforms.RandomFlipLeftRight(), color_aug, shape_aug]) 
apply(img, augs) 
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13.1.2 Using an Image Augmentation Training Model 


Next, we will look at how to apply image augmentation in actual training. Here, we use the CIFAR- 
10 dataset, instead of the Fashion-MNIST dataset we have been using. This is because the position 
and size of the objects in the Fashion-MNIST dataset have been normalized, and the differences in 
color and size ofthe objects in CIFAR-10 dataset are more significant. The first 32 training images 
in the CIFAR-10 dataset are shown below. 


d21.show_images(gluon.data.vision.CIFAR10( 
train=True)[0:32][0], 4, 8, scale=0.8); 


Daans 
E 











In order to obtain definitive results during prediction, we usually only apply image augmentation 
to the training example, and do not use image augmentation with random operations during pre- 
diction. Here, we only use the simplest random left-right flipping method. In addition, we use 
a ToTensor instance to convert minibatch images into the format required by MXNet, i.e., 32-bit 
floating point numbers with the shape of (batch size, number of channels, height, width) and 
value range between 0 and 1. 
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train_augs = gluon.data.vision.transforms.Compose([ 
gluon.data.vision.transforms.RandomFlipLeftRight() , 
gluon.data.vision. transforms. ToTensor() ]) 


test_augs = gluon.data.vision. transforms .Compose(L 
gluon.data.vision. transforms. ToTensor()]) 


Next, we define an auxiliary function to make it easier to read the image and apply image augmen- 
tation. The transform_first function provided by Gluon’s dataset applies image augmentation to 
the first element of each training example (image and label), i.e., the element at the top of the 
image. For detailed descriptions of DataLoader, refer to Section 3.5. 


def load_cifarl@(is_train, augs, batch_size): 
return gluon.data.DataLoader ( 
gluon.data.vision.CIFAR10(train=is_train) .transform_first(augs) , 
batch_size=batch_size, shuffle=is_train, 
num_workers=d21.get_dataloader_workers()) 


Using a Multi-GPU Training Model 


We train the ResNet-18 model described in Section 7.6 on the CIFAR-10 dataset. We will also apply 
the methods described in Section 12.6 and use a multi-GPU training model. 


Next, we define the training function to train and evaluate the model using multiple GPUs. 


#@save 
def train_batch_ch13(net, features, labels, loss, trainer, devices, 
split_f=d21.split_batch): 
X_shards, y_shards = split_f(features, labels, devices) 
with autograd.record(): 
pred_shards = [net(X_shard) for X_shard in X_shards] 
ls = [loss(pred_shard, y_shard) for pred_shard, y_shard 
in zip(pred_shards, y_shards)] 
for 1 in ls: 
1. backward() 
# The True flag allows parameters with stale gradients, which is useful 
# later (e.g., in fine-tuning BERT) 
trainer.step(labels.shape[Q], ignore_stale_grad=True) 
train_loss_sum = sum([float(1.sum()) for 1 in 1s]) 
train_acc_sum = sum(d21.accuracy(pred_shard, y_shard) 
for pred_shard, y_shard in zip(pred_shards, y_shards)) 
return train_loss_sum, train_acc_sum 


#@save 
def train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, 
devices=d21.try_all_gpus(), split_f=d21.split_batch): 
timer, num_batches = d21.Timer(), len(train_iter) 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1], 
legend=['train loss', 'train acc’, ‘test acc']) 
for epoch in range(num_epochs): 
# Store training_loss, training_accuracy, num_examples, num_features 
metric = d21.Accumulator (4) 


(continues on next page) 
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for i, (features, labels) in enumerate(train_iter): 
timer.start() 
1, acc = train_batch_ch13( 
net, features, labels, loss, trainer, devices, split_f) 
metric.add(1, acc, labels.shape[0], labels.size) 
timer. stop() 
if (i + 1) % (num_batches // 5) == Q or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metric[0] / metric[2], metric[1] / metric[3], 
None)) 
test_acc = d21.evaluate_accuracy_gpus(net, test_iter, split_f) 
animator.add(epoch + 1, (None, None, test_acc)) 
print(f'loss (metric[0] / metric[2]:.3f), train acc ' 
f’{metric[1] / metric[3]:.3f}, test acc {test_acc: .3f}’) 
print(f'(metric[2] * num_epochs / timer.sum():.1f} examples/sec on ' 
f'{str(devices) }’) 


Now, we can define the train_with_data_aug function to use image augmentation to train the 
model. This function obtains all available GPUs and uses Adam as the optimization algorithm 
for training. It then applies image augmentation to the training dataset, and finally calls the 
train_ch13 function just defined to train and evaluate the model. 


batch_size, devices, net = 256, d21.try_all_gpus(), d21.resnet18(10) 
net. initialize(init=init.Xavier(), ctx=devices) 


def train_with_data_aug(train_augs, test_augs, net, 1r=0.001): 
train_iter = load_cifar10(True, train_augs, batch_size) 
test_iter = load_cifarl0(False, test_augs, batch_size) 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, 
£'learning_rate': 1r)) 
train_ch13(net, train_iter, test_iter, loss, trainer, 10, devices) 


Now we train the model using image augmentation of random flipping left and right. 


train_with_data_aug(train_augs, test_augs, net) 


loss 0.171, train acc 0.941, test acc 0.830 
4575.9 examples/sec on [gpu(0), gpu(1)] 
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—— train loss 
=== train acc 
—-- test acc 





Summary 
e Image augmentation generates random images based on existing training data to cope with 
overfitting. 


* In order to obtain definitive results during prediction, we usually only apply image augmen- 
tation to the training example, and do not use image augmentation with random operations 
during prediction. 


e We can obtain classes related to image augmentation from Gluon’s transforms module. 


Exercises 


1. Train the model without using image augmentation: train_with_data_aug(no_aug, 
no_aug). Compare training and testing accuracy when using and not using image augmenta- 
tion. Can this comparative experiment support the argument that image augmentation can 
mitigate overfitting? Why? 


2. Add different image augmentation methods in model training based on the CIFAR-10 
dataset. Observe the implementation results. 


3. With reference to the MXNet documentation, what other image augmentation methods are 
provided in Gluon’s transforms module? 


Discussions!” 





17 https://discuss.d21.ai/t/367 
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13.2 Fine-Tuning 


In earlier chapters, we discussed how to train models on the Fashion-MNIST training dataset, 
which only has 60,000 images. We also described ImageNet, the most widely used large-scale 
image dataset in the academic world, with more than 10 million images and objects of over 1000 
categories. However, the size of datasets that we often deal with is usually larger than the first, but 
smaller than the second. 


Assume we want to identify different kinds of chairs in images and then push the purchase link 
to the user. One possible method is to first find a hundred common chairs, take one thousand 
different images with different angles for each chair, and then train a classification model on the 
collected image dataset. Although this dataset may be larger than Fashion-MNIST, the number of 
examples is still less than one tenth of ImageNet. This may result in the overfitting of the com- 
plicated model applicable to ImageNet on this dataset. At the same time, because of the limited 
amount of data, the accuracy of the final trained model may not meet the practical requirements. 


In order to deal with the above problems, an obvious solution is to collect more data. However, 
collecting and labeling data can consume a lot of time and money. For example, in order to collect 
the ImageNet datasets, researchers have spent millions of dollars of research funding. Although, 
recently, data collection costs have dropped significantly, the costs still cannot be ignored. 


Another solution is to apply transfer learning to migrate the knowledge learned from the source 
dataset to the target dataset. For example, although the images in ImageNet are mostly unrelated 
to chairs, models trained on this dataset can extract more general image features that can help 
identify edges, textures, shapes, and object composition. These similar features may be equally 
effective for recognizing a chair. 


In this section, we introduce a common technique in transfer learning: fine tuning. As shown in 
Fig. 13.2.1, fine tuning consists of the following four steps: 


1. Pre-train a neural network model, i.e., the source model, on a source dataset (e.g., the Ima- 
geNet dataset). 


2. Create a new neural network model, i.e., the target model. This replicates all model designs 
and their parameters on the source model, except the output layer. We assume that these 
model parameters contain the knowledge learned from the source dataset and this knowl- 
edge will be equally applicable to the target dataset. We also assume that the output layer 
of the source model is closely related to the labels of the source dataset and is therefore not 
used in the target model. 


3. Add an output layer whose output size is the number of target dataset categories to the target 
model, and randomly initialize the model parameters of this layer. 


4. Train the target model on a target dataset, such as a chair dataset. We will train the output 
layer from scratch, while the parameters of all remaining layers are fine-tuned based on the 
parameters of the source model. 
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Fig. 13.2.1: Fine tuning. 





13.2.1 Hot Dog Recognition 


Next, we will use a specific example for practice: hot dog recognition. We will fine-tune the ResNet 
model trained on the ImageNet dataset based on a small dataset. This small dataset contains thou- 
sands of images, some of which contain hot dogs. We will use the model obtained by fine tuning 
to identify whether an image contains a hot dog. 


First, import the packages and modules required for the experiment. Gluon’s model_zoo package 
provides a common pre-trained model. If you want to get more pre-trained models for computer 
vision, you can use the GluonCV Toolkit!”®. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import gluon, init, np, npx 
from mxnet.gluon import nn 

import os 


npx.set_np() 


Obtaining the Dataset 


The hot dog dataset we use was taken from online images and contains 1, 400 positive images con- 
taining hot dogs and the same number of negative images containing other foods. 1, 000 images 
of various classes are used for training and the rest are used for testing. 


We first download the compressed dataset and get two folders hotdog/train and hotdog/test. 
Both folders have hotdog and not-hotdog category subfolders, each of which has corresponding 
image files. 





8 https://gluon-cv.mxnet.io 
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#@save 
d21.DATA_HUB[ 'hotdog'] = (d21.DATA_URL+’hotdog. zip’, 
'fba480ffa8aa7e@febbb511d181409F899b9baa5') 


data_dir = d21.download_extract('hotdog') 


Downloading ../data/hotdog.zip from http://d21-data.s3-accelerate.amazonaws.com/hotdog.zip... 


We create two ImageFolderDataset instances to read all the image files in the training dataset and 
testing dataset, respectively. 


train_imgs = gluon.data.vision. ImageFolderDataset( 
os.path.join(data_dir, 'train')) 

test_imgs = gluon.data.vision.ImageFolderDataset( 
os.path.join(data_dir, 'test')) 


The first 8 positive examples and the last 8 negative images are shown below. As you can see, the 
images vary in size and aspect ratio. 


hotdogs = [train_imgs[il[0] for i in range(8)] 
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)] 
d21.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4); 





During training, we first crop a random area with random size and random aspect ratio from the 
image and then scale the area to an input with a height and width of 224 pixels. During testing, 
we scale the height and width of images to 256 pixels, and then crop the center area with height 
and width of 224 pixels to use as the input. In addition, we normalize the values of the three RGB 
(red, green, and blue) color channels. The average of all values of the channel is subtracted from 
each value and then the result is divided by the standard deviation of all values of the channel to 
produce the output. 


# We specify the mean and variance of the three RGB channels to normalize the 
# image channel 
normalize = gluon.data.vision. transforms .Normalize( 

[0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 


train_augs = gluon.data.vision.transforms.Compose([ 
gluon.data.vision. transforms .RandomResizedCrop(224) , 
gluon.data. vision. transforms.RandomFlipLeftRight() , 
gluon.data. vision. transforms. ToTensor(), 


(continues on next page) 
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normalize]) 


test_augs = gluon.data.vision.transforms.Compose([ 
gluon.data.vision.transforms.Resize(256), 
gluon.data.vision.transforms.CenterCrop(224), 
gluon.data.vision.transforms.ToTensor(), 
normalize]) 


Defining and Initializing the Model 


We use ResNet-18, which was pre-trained on the ImageNet dataset, as the source model. Here, we 
specify pretrained=True to automatically download and load the pre-trained model parameters. 
The first time they are used, the model parameters need to be downloaded from the Internet. 


pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True) 


The pre-trained source model instance contains two member variables: features and output. 
The former contains all layers of the model, except the output layer, and the latter is the output 
layer of the model. The main purpose of this division is to facilitate the fine tuning of the model 
parameters of all layers except the output layer. The member variable output of source model is 
given below. As a fully connected layer, it transforms ResNet's final global average pooling layer 
output into 1000 class output on the ImageNet dataset. 


pretrained_net.output 


Dense(512 -> 1000, linear) 


We then build a new neural network to use as the target model. It is defined in the same way as the 
pre-trained source model, but the final number of outputs is equal to the number of categories in 
the target dataset. In the code below, the model parameters in the member variable features of 
the target model instance finetune_net are initialized to model parameters of the corresponding 
layer ofthe source model. Because the model parametersin features are obtained by pre-training 
on the ImageNet dataset, itis good enough. Therefore, we generally only need to use small learn- 
ing rates to “fine-tune” these parameters. In contrast, model parameters in the member variable 
output are randomly initialized and generally require a larger learning rate to learn from scratch. 
Assume the learning rate in the Trainer instance is y and use a learning rate of 107 to update the 
model parameters in the member variable output. 


finetune_net = gluon.model_zoo.vision.resnet18_v2(classes=2) 
finetune_net.features = pretrained_net.features 
finetune_net.output.initialize(init.Xavier()) 

# The model parameters in output will be updated using a learning rate ten 
# times greater 

finetune_net.output.collect_params().setattr('1r_mult', 10) 
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Fine Tuning the Model 


We first define a training function train_fine_tuning that uses fine tuning so it can be called 
multiple times. 


def train_fine_tuning(net, learning_rate, batch_size=128, num_epochs=5): 

train_iter = gluon.data.DataLoader( 
train_imgs.transform_first(train_augs), batch_size, shuffle=True) 

test_iter = gluon.data.DataLoader ( 
test_imgs.transform_first(test_augs), batch_size) 

devices = d21.try_all_gpus() 

net.collect_params().reset_ctx(devices) 

net.hybridize() 

loss = gluon.loss.SoftmaxCrossEntropyLoss() 

trainer = gluon.Trainer(net.collect_params(), ‘sgd’, { 
'learning_rate': learning_rate, ‘wd’: 0.001)) 

d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, 

devices) 


We set the learning rate in the Trainer instance to a smaller value, such as 0.01, in order to 
fine-tune the model parameters obtained in pretraining. Based on the previous settings, we will 
train the output layer parameters of the target model from scratch using a learning rate ten times 
greater. 


train_fine_tuning(finetune_net, 0.01) 


loss 0.459, train acc 0.881, test acc 0.940 
462.9 examples/sec on [gpu(0), gpu(1)] 


—— train loss 
=== train acc 
—-= test acc 





epoch 


For comparison, we define an identical model, but initialize all of its model parameters to random 
values. Since the entire model needs to be trained from scratch, we can use a larger learning rate. 


scratch_net = gluon.model_zoo.vision.resnet18_v2(classes=2) 
scratch_net.initialize(init=init.Xavier()) 
train_fine_tuning(scratch_net, 0.1) 
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loss 0.386, train acc 0.826, test acc 0.765 
491.6 examples/sec on [gpu(0), gpu(1)] 


—— train loss 
=== train acc 
—-- test acc 





epoch 


As you can see, the fine-tuned model tends to achieve higher precision in the same epoch because 
the initial values of the parameters are better. 


Summary 
e Transfer learning migrates the knowledge learned from the source dataset to the target 
dataset. Fine tuning is a common technique for transfer learning. 


° The target model replicates all model designs and their parameters on the source model, 
except the output layer, and fine-tunes these parameters based on the target dataset. In 
contrast, the output layer of the target model needs to be trained from scratch. 


e Generally, fine tuning parameters use a smaller learning rate, while training the output layer 
from scratch can use a larger learning rate. 


Exercises 
1. Keep increasing the learning rate of finetune_net. How does the precision of the model 
change? 


2. Further tune the hyperparameters of finetune_net and scratch_net in the comparative ex- 
periment. Do they still have different precisions? 


3. Set the parameters in finetune_net. features to the parameters of the source model and do 
not update them during training. What will happen? You can use the following code. 


finetune_net.features.collect_params().setattr('grad_req', ‘null’) 


4. In fact, there is also a “hotdog” class in the ImageNet dataset. Its corresponding weight pa- 
rameter at the output layer can be obtained by using the following code. How can we use 
this parameter? 
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weight = pretrained_net.output.weight 
hotdog_w = np.split(weight.data(), 1000, axis=0)[713] 
hotdog_w. shape 


me it2)) 


Discussions??? 


13.3 Object Detection and Bounding Boxes 


In the previous section, we introduced many models for image classification. In image classifica- 
tion tasks, we assume that there is only one main target in the image and we only focus on how to 
identify the target category. However, in many situations, there are multiple targets in the image 
that we are interested in. We not only want to classify them, but also want to obtain their specific 
positions in the image. In computer vision, we refer to such tasks as object detection (or object 
recognition). 


Object detection is widely used in many fields. For example, in self-driving technology, we need 
to plan routes by identifying the locations of vehicles, pedestrians, roads, and obstacles in the 
captured video image. Robots often perform this type of task to detect targets of interest. Systems 
in the security field need to detect abnormal targets, such as intruders or bombs. 


In the next few sections, we will introduce multiple deep learning models used for object detec- 
tion. Before that, we should discuss the concept of target location. First, import the packages and 
modules required for the experiment. 


%matplotlib inline 
from d21 import mxnet as d21 
from mxnet import image, npx, np 


npx.set_np() 


Next, we will load the sample images that will be used in this section. We can see there is a dog 
on the left side of the image and a cat on the right. They are the two main targets in this image. 


d21.set_figsize() 
img = image.imread('../img/catdog.jpg').asnumpy() 
d21.plt.imshow(img) ; 





12 https://discuss.d21.ai/t/368 
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13.3.1 Bounding Box 


In object detection, we usually use a bounding box to describe the target location. The bounding 
box is a rectangular box that can be determined by the x and y axis coordinates in the upper- 
left corner and the x and y axis coordinates in the lower-right corner of the rectangle. Another 
commonly used bounding box representation is the x and y axis coordinates of the bounding box 
center, and its width and height. Here we define functions to convert between these two rep- 
resentations, box_corner_to_center converts from the two-corner representation to the center- 
width-height presentation, and box_center_to_corner vice verse. The input argument boxes can 
be either a length 4 tensor, or a (N, 4) 2-dimensional tensor. 


#@save 
def box_corner_to_center (boxes): 
"""Convert from (upper_left, bottom_right) to (center, width, height)””" 
EI 2 DOxes IE alls DOXeS | boxes i boxes [EA] 
cx (CAR rae 
cy = (y1 + y2) / 2 
w = x2 - xl 
h = y2 - y1 
boxes = np.stack((cx, cy, w, h), axis=-1) 
return boxes 


#@save 

def box_center_to_corner (boxes): 
"""Convert from (center, width, height) to (upper_left, bottom_right)””" 
cx, cy, w, h = boxes[:, 01], boxes[:, 1], boxes[:, 2], boxes[:, 3] 


xl = cx - 0.5 * w 
yl = cy - @.5*h 
x2 = cx + @.5 * w 
y2 = cy + @.5*h 


boxes = np.stack((x1, y1, x2, y2), axis=-1) 
return boxes 


We will define the bounding boxes of the dog and the cat in the image based on the coordinate 
information. The origin of the coordinates in the image is the upper left corner of the image, and 
to the right and down are the positive directions of the x axis and the y axis, respectively. 


# bbox is the abbreviation for bounding box 
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0] 
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We can verify the correctness of box conversion functions by converting twice. 


boxes = np.array((dog_bbox, cat_bbox)) 
box_center_to_corner(box_corner_to_center(boxes)) - boxes 


array([[0., 0., 0., 0.1], 
Won Dor Boy Dali) 

We can draw the bounding box in the image to check if it is accurate. Before drawing the box, we 

will define a helper function bbox_to_rect. It represents the bounding box in the bounding box 


format of matplotlib. 


#@save 
def bbox_to_rect(bbox, color): 
"""Convert bounding box to matplotlib format. 
# Convert the bounding box (top-left x, top-left y, bottom-right x, 
# bottom-right y) format to matplotlib format: ((upper-left x, 
# upper-left y), width, height) 
return d21.plt.Rectangle( 
xy=(bbox[@], bbox[1]1), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1], 
fill=False, edgecolor=color, linewidth=2) 


nnn 


After loading the bounding box on the image, we can see that the main outline of the target is 
basically inside the box. 


fig = d2l.plt.imshow(img) 


fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue')) 
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red')); 
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Summary 


e In object detection, we not only need to identify all the objects of interest in the image, but 
also their positions. The positions are generally represented by a rectangular bounding box. 


Exercises 


1. Find some images and try to label a bounding box that contains the target. Compare the 
difference between the time it takes to label the bounding box and label the category. 


Discussions!®° 


13.4 Anchor Boxes 


Object detection algorithms usually sample a large number of regions in the input image, deter- 
mine whether these regions contain objects of interest, and adjust the edges of the regions so as 
to predict the ground-truth bounding box of the target more accurately. Different models may 
use different region sampling methods. Here, we introduce one such method: it generates mul- 
tiple bounding boxes with different sizes and aspect ratios while centering on each pixel. These 
bounding boxes are called anchor boxes. We will practice object detection based on anchor boxes 
in the following sections. 


First, import the packages or modules required for this section. Here, we have modified the print- 
ing accuracy of NumPy. Because printing tensors actually calls the print function of NumPy, the 
floating-point numbers in tensors printed in this section are more concise. 


%matplotlib inline 
from d21 import mxnet as d21 
from mxnet import gluon, image, np, npx 


np.set_printoptions(2) 
npx.set_np() 


13.4.1 Generating Multiple Anchor Boxes 


Assume that the input image has a height of h and width of w. We generate anchor boxes with 
different shapes centered on each pixel ofthe image. Assume the size is s € (0, 1], the aspect ratio 
isr > 0, and the width and height of the anchor box are wsyr and hs/,/r, respectively. When the 
center position is given, an anchor box with known width and height is determined. 


Below we set a set of sizes s¡,...,s, and a set of aspect ratios r1,..., 1m. If we use a combination 
of all sizes and aspect ratios with each pixel as the center, the input image will have a total of 
whnm anchor boxes. Although these anchor boxes may cover all ground-truth bounding boxes, 
the computational complexity is often excessive. Therefore, we are usually only interested in a 
combination containing sı or rı sizes and aspect ratios, that is: 


(81,11), (sra), (81. Pins ($2; r1), (83,1); << (Sn rı). (13.4.1) 





189 https://discuss.d21.ai/t/369 
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That is, the number of anchor boxes centered on the same pixel is n + m — 1. For the entire input 
image, we will generate a total of wh(n + m — 1) anchor boxes. 


The above method of generating anchor boxes has been implemented in the multibox_prior func- 
tion. We specify the input, a set of sizes, and a set of aspect ratios, and this function will return 
all the anchor boxes entered. 


#@save 
def multibox_prior(data, sizes, ratios): 
in_height, in_width = data.shape[-2: ] 
device, num_sizes, num_ratios = data.ctx, len(sizes), len(ratios) 
boxes_per_pixel = (num_sizes + num_ratios - 1) 
size_tensor = np.array(sizes, ctx=device) 
ratio_tensor = np.array(ratios, ctx=device) 
# Offsets are required to move the anchor to center of a pixel 
# Since pixel (height=1, width=1), we choose to offset our centers by 0.5 
offset_h, offset_w = 0.5, 0.5 
steps_h = 1.0 / in_height + Scaled steps in y axis 
steps_w = 1.0 / in_width + Scaled steps in x axis 


# Generate all center points for the anchor boxes 

center_h = (np.arange(in_height, ctx=device) + offset_h) * steps_h 
center_w = (np.arange(in_width, ctx=device) + offset_w) * steps_w 
shift_x, shift_y = np.meshgrid(center_w, center_h) 

shift_x, shift_y = shift_x.reshape(-1), shift_y.reshape(-1) 


# Generate boxes_per_pixel number of heights and widths which are later 
# used to create anchor box corner coordinates (xmin, xmax, ymin, ymax) 
# concat (various sizes, first ratio) and (first size, various ratios) 
w = np.concatenate((size_tensor * np.sqrt(ratio_tensor[@]), 

sizes[0] * np.sqrt(ratio_tensor[1:])))\ 

* in_height / in_width # handle rectangular inputs 
h = np.concatenate((size_tensor / np.sqrt(ratio_tensor[Q]), 

sizes[0] / np.sqrt(ratio_tensor[1:]))) 
# Divide by 2 to get half height and half width 
anchor_manipulations = np.tile(np.stack((-w, -h, w, h)).T, 

(in_height * in_width, 1)) / 2 


# Each center point will have boxes_per_pixel number of anchor boxes, so 

# generate grid of all anchor box centers with boxes_per_pixel repeats 

out_grid = np.stack([shift_x, shift_y, shift_x, shift_yl, 
axis=1).repeat(boxes_per_pixel, axis=0) 


output = out_grid + anchor_manipulations 
return np.expand_dims(output, axis=0) 


We can see that the shape of the returned anchor box variable y is (batch size, number of anchor 
boxes, 4). 


img = image.imread('../img/catdog.jpg').asnumpy() 
h, w = img.shapeL@: 2] 


print(h, w) 

X = np.random.uniform(size=(1, 3, h, w)) # Construct input data 
Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]) 
Y.shape 
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5610728 


(1, 2042040, 4) 


After changing the shape of the anchor box variable y to (image height, image width, number of 
anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes centered on a 
specified pixel position. In the following example, we access the first anchor box centered on 
(250, 250). It has four elements: the x, y axis coordinates in the upper-left corner and the z, y axis 
coordinates in the lower-right corner of the anchor box. The coordinate values of the x and y axis 
are divided by the width and height of the image, respectively, so the value range is between 0 and 
1. 


boxes = Y.reshape(h, w, 5, 4) 
boxes[250, 250, 0, :] 


array([0.06, 0.07, 0.63, 0.82]) 


In order to describe all anchor boxes centered on one pixel in the image, we first define the 
show_bboxes function to draw multiple bounding boxes on the image. 


#@save 
def show_bboxes(axes, bboxes, labels=None, colors=None): 
"""Show bounding boxes.”"” 
def _make_list(obj, default_values=None): 
if obj is None: 
obj = default_values 
elif not isinstance(obj, (list, tuple)): 
obj = [obj] 
return obj 


labels = _make_list(labels) 
colors = _make_list(colors, ['b', 'g’, 'r', 'm', 'c']) 
for i, bbox in enumerate(bboxes): 
color = colors[i % len(colors)] 
rect = d21.bbox_to_rect(bbox.asnumpy(), color) 
axes.add_patch(rect) 
if labels and len(labels) > i: 
text_color = 'k' if color == 'w' else ‘w’ 
axes.text(rect.xyl0], 
rect.xy[1], 
labels[il, 
va='center', 
ha='center', 
fontsize=9, 
color=text_color, 
bbox=dict(facecolor=color, 1w=0)) 


As we just saw, the coordinate values of the x and y axis in the variable boxes have been divided 
by the width and height of the image, respectively. When drawing images, we need to restore 
the original coordinate values of the anchor boxes and therefore define the variable bbox_scale. 
Now, we can draw all the anchor boxes centered on (250, 250) in the image. As you can see, the 
blue anchor box with a size of 0.75 and an aspect ratio of 1 covers the dog in the image well. 
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d21.set_figsize() 
bbox_scale = np.array((w, h, w, h)) 
fig = d21.p1t.imshow(img) 
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale, 
['s=0.75, r=1’, 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2’, 
"s=0.75, r=0.5']) 





s=0 7S 0.5 





0 200 400 600 


13.4.2 Intersection over Union 


We just mentioned that the anchor box covers the dog in the image well. If the ground-truth 
bounding box of the target is known, how can “well” here be quantified? An intuitive method 
is to measure the similarity between anchor boxes and the ground-truth bounding box. We know 
that the Jaccard index can measure the similarity between two sets. Given sets A and B, their 
Jaccard index is the size of their intersection divided by the size of their union: 

AN B| 


J(A,B) = AU Bl (13.4.2) 





In fact, we can consider the pixel area of a bounding box as a collection of pixels. In this way, 
we can measure the similarity of the two bounding boxes by the Jaccard index of their pixel sets. 
When we measure the similarity of two bounding boxes, we usually refer the Jaccard index as 
intersection over union (IoU), which is the ratio of the intersecting area to the union area of the 
two bounding boxes, as shown in Fig. 13.4.1. The value range of IoU is between 0 and 1: 0 means 
that there are no overlapping pixels between the two bounding boxes, while 1 indicates that the 
two bounding boxes are equal. 


l 


loU = 


Fig. 13.4.1: IoU is the ratio of the intersecting area to the union area of two bounding boxes. 
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For the remainder of this section, we will use IoU to measure the similarity between anchor boxes 
and ground-truth bounding boxes, and between different anchor boxes. 


#@save 
def box_iou(boxes1, boxes2): 
"""Compute IOU between two sets of boxes of shape (N,4) and (M,4).”"” 
# Compute box areas 
box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) * 
(boxes[:, 3] - boxes[:, 1])) 
areal = box_area(boxes1) 
area2 = box_area(boxes2) 
lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2]) + [N,M,2] 
rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:]) + [N,M,2] 
wh = (rb - 1t).clip(min=0) + [N,M,2] 
inter = wh[:, :, 07 * whl:, :, 1] # [N,M] 
unioun = areal[:, None] + area2 - inter 
return inter / unioun 


13.4.3 Labeling Training Set Anchor Boxes 


In the training set, we consider each anchor box as a training example. In order to train the object 
detection model, we need to mark two types of labels for each anchor box: first, the category of the 
target contained in the anchor box (category) and, second, the offset of the ground-truth bounding 
box relative to the anchor box (offset). In object detection, we first generate multiple anchor boxes, 
predict the categories and offsets for each anchor box, adjust the anchor box position according 
to the predicted offset to obtain the bounding boxes to be used for prediction, and finally filter out 
the prediction bounding boxes that need to be output. 


We know that, in the object detection training set, each image is labelled with the location of the 
ground-truth bounding box and the category of the target contained. After the anchor boxes are 
generated, we primarily label anchor boxes based on the location and category information of 
the ground-truth bounding boxes similar to the anchor boxes. So how do we assign ground-truth 
bounding boxes to anchor boxes similar to them? 


Assume that the anchor boxes in the image are Aj, Ao,..., An, and the ground-truth bounding 
boxes are B1, B2,..., Bn, and na > ny. Define matrix X € R":*">, where element x;; in the i row 
and j column is the IoU of the anchor box A; to the ground-truth bounding box Bj. First, we find 
the largest element in the matrix X and record the row index and column index of the element 
as i1, jı. We assign the ground-truth bounding box B;, to the anchor box A;,. Obviously, anchor 
box A;, and ground-truth bounding box B;, have the highest similarity among all the “anchor 
box-ground-truth bounding box” pairings. Next, discard all elements in the ith row and the 7, th 
column in the matrix X. Find the largest remaining element in the matrix X and record the row 
index and column index of the element as iz, jo. We assign ground-truth bounding box B;, to 
anchor box A;, and then discard all elements in the ith row and the jth column in the matrix X. 
At this point, elements in two rows and two columns in the matrix X have been discarded. 


We proceed until all elements in the n, column in the matrix X are discarded. At this time, we have 
assigned a ground-truth bounding box to each of the n, anchor boxes. Next, we only traverse the 
remaining na — ny anchor boxes. Given anchor box Aj, find the bounding box B; with the largest 
IoU with A; according to the it row of the matrix X, and only assign ground-truth bounding box 
Bj to anchor box A; when the IoU is greater than the predetermined threshold. 


As shown in Fig. 13.4.2 (left), assuming that the maximum value in the matrix X is x23, we will 
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assign ground-truth bounding box B; to anchor box A». Then, we discard all the elements in row 
2 and column 3 of the matrix, find the largest element x7; of the remaining shaded area, and assign 
ground-truth bounding box B; to anchor box 47. Then, as shown in Fig. 13.4.2 (middle), discard 
all the elements in row 7 and column 1 of the matrix, find the largest element x54 of the remaining 
shaded area, and assign ground-truth bounding box By, to anchor box As. Finally, as shown in 
Fig. 13.4.2 (right), discard all the elements in row 5 and column 4 of the matrix, find the largest 
element x92 of the remaining shaded area, and assign ground-truth bounding box Ba to anchor 
box Ay. After that, we only need to traverse the remaining anchor boxes of Aj, A3, Aa, A6, As 
and determine whether to assign ground-truth bounding boxes to the remaining anchor boxes 
according to the threshold. 


Ground-truth bounding box index 


Anchor 4 
box 
index 5 





Fig. 13.4.2: Assign ground-truth bounding boxes to anchor boxes. 


#@save 

def match_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5): 
"""Assign ground-truth bounding boxes to anchor boxes similar to them. 
num_anchors, num_gt_boxes = anchors.shape[@], ground_truth.shape[0] 
# Element ‘x_ij‘ in the ‘i*th* row and ‘j*th* column is the IoU 
# of the anchor box 'anc_i' to the ground-truth bounding box *box_j' 
jaccard = box_iou(anchors, ground_truth) 
# Initialize the tensor to hold assigned ground truth bbox for each anchor 
anchors_bbox_map = np.full((num_anchors,), -1, dtype=np.int32, ctx=device) 
# Assign ground truth bounding box according to the threshold 
max_ious, indices = np.max(jaccard, axis=1), np.argmax(jaccard, axis=1) 
anc_i = np.nonzero(max_ious >= 0.5)L0] 
box_j = indices[max_ious >= 0.5] 
anchors_bbox_map[Lanc_i] = box_j 
# Find the largest iou for each bbox 
col_discard = np.full((num_anchors,), -1) 
row_discard = np.full((num_gt_boxes,), -1) 
for _ in range(num_gt_boxes) : 


max_idx = np.argmax(jaccard) 


nnn 


(continues on next page) 
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(continued from previous page) 
box_idx = (max_idx % num_gt_boxes).astype('int32') 
anc_idx = (max_idx / num_gt_boxes).astype('int32') 
anchors_bbox_maplanc_idx] = box_idx 
jaccard[:, box_idx] = col_discard 
jaccard[anc_idx, :] = row_discard 
return anchors_bbox_map 


Now we can label the categories and offsets of the anchor boxes. If an anchor box A is assigned 
ground-truth bounding box B, the category of the anchor box A is set to the category of B. And 
the offset of the anchor box A is set according to the relative position of the central coordinates of 
B and A and the relative sizes of the two boxes. Because the positions and sizes of various boxes 
in the dataset may vary, these relative positions and relative sizes usually require some special 
transformations to make the offset distribution more uniform and easier to fit. Assume the center 
coordinates of anchor box A and its assigned ground-truth bounding box B are (ta, ya), (Ub, Yo); 
the widths of A and B are wa, wp, and their heights are ha, hy, respectively. In this case, a common 
technique is to label the offset of A as 


(= Uz Ye Hy log a — Hw log E — z) 





(13.4.3) 





3 + 
Tx Oy Ow Oh 








The default values of the constant are ps = Hy = Hw = Hh = 0,02 = Oy = 0.1,andoy = 0), = 
0.2. This transformation is implemented below in the offset_boxes function. If an anchor box is 
not assigned a ground-truth bounding box, we only need to set the category of the anchor box to 
background. Anchor boxes whose category is background are often referred to as negative anchor 
boxes, and the rest are referred to as positive anchor boxes. 


#@save 

def offset_boxes(anchors, assigned_bb, eps=1e-6): 
c_anc = d21.box_corner_to_center (anchors) 
c_assigned_bb = d21.box_corner_to_center (assigned_bb) 
offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:] 
offset_wh = 5 * np.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:]) 
offset = np.concatenate(Loffset_xy, offset_wh], axis=1) 
return offset 


#@save 
def multibox_target(anchors, labels): 
batch_size, anchors = labels.shape[@], anchors. squeeze(Q) 
batch_offset, batch_mask, batch_class_labels = [], [], [] 
device, num_anchors = anchors.ctx, anchors.shape[@] 
for i in range(batch_size): 
label = labels[i, :, :] 
anchors_bbox_map = match_anchor_to_bbox(label[:, 1:], anchors, device) 
bbox_mask = np.tile((np.expand_dims((anchors_bbox_map >= Q), 
axis=-1)), (1, 4)).astype('int32') 
# Initialize class_labels and assigned bbox coordinates with zeros 
class_labels = np.zeros(num_anchors, dtype=np.int32, ctx=device) 
assigned_bb = np.zeros((num_anchors, 4), dtype=np.float32, ctx=device) 
# Assign class labels to the anchor boxes using matched gt bbox labels 
# If no gt bbox is assigned to an anchor box, then let the 
# class_labels and assigned_bb remain zero, i.e the background class 
indices_true = np.nonzero(anchors_bbox_map >= 0)[0] 


(continues on next page) 
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(continued from previous page) 


bb_idx = anchors_bbox_mapLindices_true] 
class_labelsLindices_true] = label[bb_idx, 0].astype('int32') + 1 
assigned_bblindices_truel = label[bb_idx, 1:] 
# offset transformations 
offset = offset_boxes(anchors, assigned_bb) * bbox_mask 
batch_offset.append(offset.reshape(-1)) 
batch_mask.append(bbox_mask.reshape(-1)) 
batch_class_labels.append(class_labels) 

bbox_offset = np.stack(batch_offset) 

bbox_mask = np.stack(batch_mask) 

class_labels = np.stack(batch_class_labels) 

return (bbox_offset, bbox_mask, class_labels) 


Below we demonstrate a detailed example. We define ground-truth bounding boxes for the cat and 
dog in the read image, where the first element is category (0 for dog, 1 for cat) and the remaining 
four elements are the x, y axis coordinates at top-left corner and x, y axis coordinates at lower- 
right corner (the value range is between 0 and 1). Here, we construct five anchor boxes to be 
labeled by the coordinates of the upper-left corner and the lower-right corner, which are recorded 
as Ao,...,A4, respectively (the index in the program starts from 0). First, draw the positions of 
these anchor boxes and the ground-truth bounding boxes in the image. 


ground_truth = np.array([[0, 0.1, 0.08, 0.52, 0.921, 
El, 0.30), O.2, 00, 0.41) 
anchors = np.array([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4], 
[0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.81, 
[0.57, 0.3, 0.92, @.9]]) 


fig = d21.p1t.imshow(img) 
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], ‘k’) 
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']); 





We can label categories and offsets for anchor boxes by using the multibox_target function. This 
function sets the background category to 0 and increments the integer index of the target category 
from zero by 1 (1 for dog and 2 for cat). 


We add example dimensions to the anchor boxes and ground-truth bounding boxes and construct 
random predicted results with a shape of (batch size, number of categories including background, 
number of anchor boxes) by using the expand_dims function. 
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labels = multibox_target(np.expand_dims(anchors, axis=0), 
np.expand_dims(ground_truth, axis=0)) 


There are three items in the returned result, all of which are in the tensor format. The third item 
is represented by the category labeled for the anchor box. 


labels[2] 


array([[0, 1, 2, 0, 21], dtype=int32) 


We analyze these labelled categories based on positions of anchor boxes and ground-truth bound- 
ing boxes in the image. First, in all “anchor box-ground-truth bounding box” pairs, the IoU of 
anchor box Ay, to the ground-truth bounding box of the cat is the largest, so the category of an- 
chor box Ay is labeled as cat. Without considering anchor box A, or the ground-truth bounding 
box of the cat, in the remaining “anchor box-ground-truth bounding box” pairs, the pair with the 
largest IoU is anchor box A, and the ground-truth bounding box of the dog, so the category of 
anchor box A; is labeled as dog. Next, traverse the remaining three unlabeled anchor boxes. The 
category of the ground-truth bounding box with the largest IoU with anchor box Ao is dog, but 
the IoU is smaller than the threshold (the default is 0.5), so the category is labeled as background; 
the category of the ground-truth bounding box with the largest IoU with anchor box Ag is cat and 
the IoU is greater than the threshold, so the category is labeled as cat; the category of the ground- 
truth bounding box with the largest IoU with anchor box Az is cat, but the IoU is smaller than the 
threshold, so the category is labeled as background. 


The second item of the return value is a mask variable, with the shape of (batch size, four times 
the number of anchor boxes). The elements in the mask variable correspond one-to-one with the 
four offset values of each anchor box. Because we do not care about background detection, offsets 
of the negative class should not affect the target function. By multiplying by element, the 0 in the 
mask variable can filter out negative class offsets before calculating target function. 


labels[1] 


Seravt Wie) Os Os L I, 1 i, 1 i, 1 l 0 0 0O O iy ie dae 
dtype=int32) 


The first item returned is the four offset values labeled for each anchor box, with the offsets of 
negative class anchor boxes labeled as 0. 


labels[0] 


array(L[-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, 1.40e+00, 1.00e+01, 
2.59e+00, 7.18e+00, -1.20e+00, 2.69e-01, 1.68e+00, -1.57e+00, 
-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, -5.71e-01, -1.00e+00, 
4.17e-06, 6.26e-01]]) 
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13.4.4 Bounding Boxes for Prediction 


During model prediction phase, we first generate multiple anchor boxes for the image and then 
predict categories and offsets for these anchor boxes one by one. Then, we obtain prediction 
bounding boxes based on anchor boxes and their predicted offsets. 


Below we implement function offset_inverse which takes in anchors and offset predictions as 
inputs and applies inverse offset transformations to return the predicted bounding box coordi- 
nates. 


#@save 

def offset_inverse(anchors, offset_preds): 
c_anc = d21.box_corner_to_center (anchors) 
c_pred_bb_xy = (offset_preds[:, :2] * c_anc[:, 2:] / 10) + c_anc[:, :2] 
c_pred_bb_wh = np.exp(offset_preds[:, 2:] / 5) * c_ancL:, 2:] 
c_pred_bb = np.concatenate((c_pred_bb_xy, c_pred_bb_wh), axis=1) 
predicted_bb = d21.box_center_to_corner(c_pred_bb) 
return predicted_bb 





When there are many anchor boxes, many similar prediction bounding boxes may be output for 
the same target. To simplify the results, we can remove similar prediction bounding boxes. A 
commonly used method is called non-maximum suppression (NMS). 


Let us take a look at how NMS works. For a prediction bounding box B, the model calculates 
the predicted probability for each category. Assume the largest predicted probability is p, the 
category corresponding to this probability is the predicted category of B. We also refer to p as 
the confidence level of prediction bounding box B. On the same image, we sort the prediction 
bounding boxes with predicted categories other than background by confidence level from high 
to low, and obtain the list L. Select the prediction bounding box B, with highest confidence level 
from L as a baseline and remove all non-benchmark prediction bounding boxes with an IoU with 
Bı greater than a certain threshold from L. The threshold here is a preset hyperparameter. At this 
point, L retains the prediction bounding box with the highest confidence level and removes other 
prediction bounding boxes similar to it. Next, select the prediction bounding box Bə with the 
second highest confidence level from L as a baseline, and remove all non-benchmark prediction 
bounding boxes with an IoU with Ba greater than a certain threshold from L. Repeat this process 
until all prediction bounding boxes in L have been used as a baseline. At this time, the IoU of any 
pair of prediction bounding boxes in L is less than the threshold. Finally, output all prediction 
bounding boxes in the list L. 


#@save 
def nms(boxes, scores, iou_threshold): 
# sorting scores by the descending order and return their indices 
B = scores.argsort()[::-1] 
keep = [] # boxes indices that will be kept 
while B.size > 0: 
i = BLo] 
keep. append(i) 
if B.size == 1: break 
iou = box_iou(boxes[i, :].reshape(-1, 4), 
boxes[B[1:], :].reshape(-1, 4)).reshape(-1) 
inds = np.nonzero(iou <= iou_threshold)[@] 
B = BLinds + 1] 
return np.array(keep, dtype=np.int32, ctx=boxes.ctx) 


(continues on next page) 
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(continued from previous page) 


#@save 
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5, 
pos_threshold=0. 20999999978): 
device, batch_size = cls_probs.ctx, cls_probs.shape[0] 
anchors = np.squeeze(anchors, axis=0) 
num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2] 
out = [] 
for i in range(batch_size): 
cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4) 
conf, class_id = np.max(cls_prob[1:], 0), np.argmax(cls_prob[1:], 0) 
predicted_bb = offset_inverse(anchors, offset_pred) 
keep = nms(predicted_bb, conf, 0.5) 
# Find all non_keep indices and set the class_id to background 
all_idx = np.arange(num_anchors, dtype=np.int32, ctx=device) 
combined = np.concatenate((keep, all_idx)) 
unique, counts = np.unique(combined, return_counts=True) 


non_keep = unique[counts == 1] 
all_id_sorted = np.concatenate((keep, non_keep)) 
class_id[non_keep] = -1 


class_id = class_id[all_id_sorted].astype(’ float32') 

conf, predicted_bb = confLall_id_sorted], predicted_bbLall_id_sorted] 

# threshold to be a positive prediction 

below_min_idx = (conf < pos_threshold) 

class_id[below_min_idx] = -1 

conf[below_min_idx] = 1 - conf[below_min_idx] 

pred_info = np.concatenate((np.expand_dims(class_id, axis=1), 
np.expand_dims(conf, axis=1), 
predicted_bb), axis=1) 

out. append(pred_info) 

return np.stack(out) 


Next, we will look at a detailed example. First, construct four anchor boxes. For the sake of sim- 
plicity, we assume that predicted offsets are all 0. This means that the prediction bounding boxes 
are anchor boxes. Finally, we construct a predicted probability for each category. 


anchors = np.array([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95], 
[0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]]) 
offset_preds = np.array([0] * anchors.size) 
cls_probs = np.array([[0] * 4, + Predicted probability for background 
[0.9, 0.8, 0.7, @.1], + Predicted probability for dog 
[0.1, 0.2, 0.3, @.9]]) + Predicted probability for cat 


Print prediction bounding boxes and their confidence levels on the image. 


fig = d21.p1t.imshow(img) 
show_bboxes(fig.axes, anchors * bbox_scale, 
['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9']) 
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We usethe multibox_detection function to perform NMS and setthe threshold to 0.5. This adds an 
example dimension to the tensor input. We can see that the shape of the returned result is (batch 
size, number of anchor boxes, 6). The 6 elements of each row represent the output information 
for the same prediction bounding box. The first element is the predicted category index, which 
starts from 0 (0 is dog, 1 is cat). The value -1 indicates background or removal in NMS. The second 
element is the confidence level of prediction bounding box. The remaining four elements are the 
x, y axis coordinates of the upper-left corner and the z, y axis coordinates of the lower-right corner 
of the prediction bounding box (the value range is between 0 and 1). 


output = multibox_detection( 
np.expand_dims(cls_probs, axis=0), 
np.expand_dims(offset_preds, axis=0), 
np.expand_dims(anchors, axis=0), 
nms_threshold=0.5) 


output 

array([[[ 1. , 0.9, 0.55, 0.2, 0.9, 0.88], 
EO o 0.9, 0l 0.08, 0.52, 0.921, 
[-1. , 0.8, 0.08, 0.2, 0.56, 0.951, 
El Gar Oo, 03. “e562, AA 


We remove the prediction bounding boxes of category -1 and visualize the results retained by NMS. 


fig = d21.plt.imshow(img) 
for i in output[0].asnumpy(): 
if 1[0] == -1: 
continue 
label = ('dog=', ‘cat=')[int(i[@])] + str(i[l1]) 
show_bboxes(fig.axes, [np.array(il2:]) * bbox_scale], label) 
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In practice, we can remove prediction bounding boxes with lower confidence levels before per- 
forming NMS, thereby reducing the amount of computation for NMS. We can also filter the output 
of NMS, for example, by only retaining results with higher confidence levels as the final output. 


Summary 
e We generate multiple anchor boxes with different sizes and aspect ratios, centered on each 
pixel. 


e IoU, also called Jaccard index, measures the similarity of two bounding boxes. It is the ratio 
of the intersecting area to the union area of two bounding boxes. 


+ Inthe training set, we mark two types of labels for each anchor box: one is the category of the 
target contained in the anchor box and the other is the offset of the ground-truth bounding 
box relative to the anchor box. 


e When predicting, we can use non-maximum suppression (NMS) to remove similar predic- 
tion bounding boxes, thereby simplifying the results. 


Exercises 
1. Change the sizes and ratios values in the multibox_prior function and observe the changes 
to the generated anchor boxes. 
2. Construct two bounding boxes with an IoU of 0.5, and observe their coincidence. 


3. Verify the output of offset labels[@] by marking the anchor box offsets as defined in this 
section (the constant is the default value). 


4. Modify the variable anchors in the “Labeling Training Set Anchor Boxes” and “Output Bound- 
ing Boxes for Prediction” sections. How do the results change? 


Discussions!*! 





18! https://discuss.d21.ai/t/370 





13.4. Anchor Boxes 595 


13.5 Multiscale Object Detection 


In Section 13.4, we generated multiple anchor boxes centered on each pixel of the input image. 
These anchor boxes are used to sample different regions of the input image. However, if anchor 
boxes are generated centered on each pixel of the image, soon there will be too many anchor boxes 
for us to compute. For example, we assume that the input image has a height and a width of 561 
and 728 pixels respectively. If five different shapes of anchor boxes are generated centered on 
each pixel, over two million anchor boxes (561 x 728 x 5) need to be predicted and labeled on the 
image. 


Itis not difficult to reduce the number of anchor boxes. An easy way is to apply uniform sampling 
on a small portion of pixels from the input image and generate anchor boxes centered on the sam- 
pled pixels. In addition, we can generate anchor boxes of varied numbers and sizes on multiple 
scales. Notice that smaller objects are more likely to be positioned on the image than larger ones. 
Here, we will use a simple example: Objects with shapes of 1 x 1, 1 x 2, and 2 x 2 may have 4, 2, 
and 1 possible position(s) on an image with the shape 2 x 2. Therefore, when using smaller anchor 
boxes to detect smaller objects, we can sample more regions; when using larger anchor boxes to 
detect larger objects, we can sample fewer regions. 


To demonstrate how to generate anchor boxes on multiple scales, let us read an image first. It has 
a height and width of 561 x 728 pixels. 


%matplotlib inline 
from d21 import mxnet as d21 
from mxnet import image, np, npx 


npx.set_np() 


img = image.imread('../img/catdog.jpg') 
h, w = img.shapeLQ: 2] 
h, w 


(561, 728) 


In Section 6.2, the 2D array output of the convolutional neural network (CNN) is called a feature 
map. We can determine the midpoints of anchor boxes uniformly sampled on any image by defin- 
ing the shape of the feature map. 


The function display_anchors is defined below. We are going to generate anchor boxes anchors 
centered on each unit (pixel) on the feature map fmap. Since the coordinates of axes x and y in 
anchor boxes anchors have been divided by the width and height of the feature map fmap, values 
between 0 and 1 can be used to represent relative positions of anchor boxes in the feature map. 
Since the midpoints of anchor boxes anchors overlap with all the units on feature map fmap, the 
relative spatial positions of the midpoints of the anchors on any image must have a uniform dis- 
tribution. Specifically, when the width and height of the feature map are set to fmap_w and fmap_h 
respectively, the function will conduct uniform sampling for fmap_h rows and fmap_w columns of 
pixels and use them as midpoints to generate anchor boxes with size s (we assume that the length 
of list s is 1) and different aspect ratios (ratios). 


def display_anchors(fmap_w, fmap_h, s): 
d21.set_figsize() 
# The values from the first two dimensions will not affect the output 


(continues on next page) 
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fmap = np.zeros((1, 10, fmap_h, fmap_w)) 
anchors = npx.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5]) 
bbox_scale = np.array((w, h, w, h)) 
d21.show_bboxes(d21.p1t.imshow(img.asnumpy()).axes, 

anchors[0] * bbox_scale) 


We will first focus on the detection of small objects. In order to make it easier to distinguish upon 
display, the anchor boxes with different midpoints here do not overlap. We assume that the size 
of the anchor boxes is 0.15 and the height and width of the feature map are 4. We can see that the 
midpoints of anchor boxes from the 4 rows and 4 columns on the image are uniformly distributed. 


display_anchors(fmap_w=4, fmap_h=4, s=[0.15]) 
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We are going to reduce the height and width of the feature map by half and use a larger anchor 
box to detect larger objects. When the size is set to 0.4, overlaps will occur between regions of 
some anchor boxes. 


display_anchors(fmap_w=2, fmap_h=2, s=[0.4]) 
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Finally, we are going to reduce the height and width of the feature map by half and increase the 
anchor box size to 0.8. Now the midpoint of the anchor box is the center of the image. 
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display_anchors(fmap_w=1, fmap_h=1, s=[0.8]) 





Since we have generated anchor boxes of different sizes on multiple scales, we will use them to 
detect objects of various sizes at different scales. Now we are going to introduce a method based 
on convolutional neural networks (CNNs). 


Ata certain scale, suppose we generate h x w sets of anchor boxes with different midpoints based 
on c; feature maps with the shape h x w and the number of anchor boxes in each set is a. For 
example, for the first scale of the experiment, we generate 16 sets of anchor boxes with different 
midpoints based on 10 (number of channels) feature maps with a shape of 4 x 4, and each set 
contains 3 anchor boxes. Next, each anchor box is labeled with a category and offset based on 
the classification and position of the ground-truth bounding box. At the current scale, the object 
detection model needs to predict the category and offset of h x w sets of anchor boxes with different 
midpoints based on the input image. 


We assume that the c; feature maps are the intermediate output of the CNN based on the input 
image. Since each feature map has h x w different spatial positions, the same position will have c; 
units. According to the definition of receptive field in the Section 6.2, the c; units of the feature map 
at the same spatial position have the same receptive field on the input image. Thus, they represent 
the information of the input image in this same receptive field. Therefore, we can transform the c; 
units of the feature map at the same spatial position into the categories and offsets of the a anchor 
boxes generated using that position as a midpoint. It is not hard to see that, in essence, we use the 
information of the input image in a certain receptive field to predict the category and offset of the 
anchor boxes close to the field on the input image. 


When the feature maps of different layers have receptive fields of different sizes on the input im- 
age, they are used to detect objects of different sizes. For example, we can design a network to 
have a wider receptive field for each unit in the feature map that is closer to the output layer, to 
detect objects with larger sizes in the input image. 


We will implement a multiscale object detection model in the following section. 
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Summary 


+ We can generate anchor boxes with different numbers and sizes on multiple scales to detect 
objects of different sizes on multiple scales. 


* The shape of the feature map can be used to determine the midpoint of the anchor boxes 
that uniformly sample any image. 


e We use the information for the input image from a certain receptive field to predict the cat- 
egory and offset of the anchor boxes close to that field on the image. 


Exercises 


1. Given an input image, assume 1 x c; x h x w to be the shape of the feature map while c;, h, w 
are the number, height, and width of the feature map. What methods can you think of to 
convert this variable into the anchor box’s category and offset? What is the shape of the 
output? 


Discussions!®2 


13.6 The Object Detection Dataset 


There are no small datasets, like MNIST or Fashion-MNIST, in the object detection field. In order 
to quickly test models, we are going to assemble a small dataset. First, we generate 1000 banana 
images of different angles and sizes using free bananas from our office. Then, we collect a series 
of background images and place a banana image at a random position on each image. 


13.6.1 Downloading the Dataset 


The banana detection dataset with all the images and csv label files can be downloaded directly 
from the Internet. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import gluon, image, np, npx 
import os 

import pandas as pd 


npx.set_np() 
#@save 


d21.DATA_HUB[ 'banana-detection'] = (d21.DATA_URL + 'banana-detection.zip', 
'5bde26c8fce5ccdea9f91267273464dc968d20d72'> 
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13.6.2 Reading the Dataset 


We are going to read the object detection dataset in the read_data_bananas function. The dataset 
includes a csv file for target class labels and ground truth bounding box coordinates in the cor- 
ner format. We define BananasDataset to create the Dataset instance and finally define the 
load_data_bananas function to return the dataloaders. There is no need to read the test dataset in 
random order. 


#@save 
def read_data_bananas(is_train=True): 
"""Read the bananas dataset images and labels. 
data_dir = d21.download_extract('banana-detection') 
csv_fname = os.path.join(data_dir, 'bananas_train' if is_train 
else 'bananas_val', 'label.csv')>) 
csv_data = pd.read_csv(csv_fname) 
csv_data = csv_data.set_index(’ img_name’) 
images, targets = [], [] 
for img_name, target in csv_data.iterrows(): 
images. append(image. imread( 
os.path.join(data_dir, 'bananas_train' if is_train else 
'bananas_val', ‘images’, f'{img_name}’))) 
$ Since all images have same object class i.e. category '0', 
# the ‘label* column corresponds to the only object i.e. banana 
# The target is as follows : ('label*, 'xmin', ‘ymin*, ‘xmax*, ‘ymax*) 
targets.append(list(target)) 
return images, np.expand_dims(np.array(targets), 1) / 256 


nnn 


#@save 
class BananasDataset(gluon.data.Dataset): 
def __init__(self, is_train): 
self .features, self.labels = read_data_bananas(is_train) 
print('read ' + str(len(self.features)) + (f' training examples’ if 


is_train else f' validation examples')) 


def __getitem__(self, idx): 
return (self.features[idx].astype('float32').transpose(2, 0, 1), 
self .labelsLidx]) 


def __len__(self): 
return len(self.features) 


#@save 
def load_data_bananas(batch_size): 
""*"Load the bananas dataset.”"” 
train_iter = gluon.data.DataLoader (BananasDataset(is_train=True), 
batch_size, shuffle=True) 
val_iter = gluon.data.DataLoader (BananasDataset(is_train=False) , 
batch_size) 
return (train_iter, val_iter) 


Below, we read a minibatch and print the shape of the image and label. The shape of the image 
is the same as in the previous experiment (batch size, number of channels, height, width). The 
shape of the label is (batch size, m, 5), where m is equal to the maximum number of bounding 
boxes contained in a single image in the dataset. Although computation for the minibatch is very 
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efficient, it requires each image to contain the same number of bounding boxes so that they can be 
placed in the same batch. Since each image may have a different number of bounding boxes, we 
can add illegal bounding boxes to images that have less than m bounding boxes until each image 
contains m bounding boxes. Thus, we can read a minibatch ofimages each time. The label of each 
bounding box in the image is represented by a tensor of length 5. The first element in the tensor is 
the category of the object contained in the bounding box. When the value is -1, the bounding box 
is an illegal bounding box for filling purpose. The remaining four elements of the array represent 
the x, y axis coordinates of the upper-left corner of the bounding box and the z, y axis coordinates 
of the lower-right corner of the bounding box (the value range is between 0 and 1). The banana 
dataset here has only one bounding box per image, so m = 1. 


batch_size, edge_size = 32, 256 

train_iter, _ = load_data_bananas(batch_size) 
batch = next(iter(train_iter)) 
batch[0].shape, batch[1].shape 


Downloading ../data/banana-detection.zip from http: //d21-data.s3-accelerate.amazonaws.com/ 
<+banana-detection.zip... 

read 1000 training examples 

read 100 validation examples 


((32, 3, 256, 256), (32, 1, 5)) 


13.6.3 Demonstration 


We have ten images with bounding boxes on them. We can see that the angle, size, and position of 
banana are different in each image. Of course, this is a simple artificial dataset. In actual practice, 
the data are usually much more complicated. 


imgs = (batch[0][0:107.transpose(0, 2, 3, 1)) / 255 
axes = d21.show_images(imgs, 2, 5, scale=2) 
for ax, label in zip(axes, batch[1][0:10]): 
d21.show_bboxes(ax, [Llabel[@][1:5] * edge_size], colors=[ 'w']) 
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Summary 


° The banana detection dataset we synthesized can be used to test object detection models. 


+ The data reading for object detection is similar to that for image classification. However, 
after we introduce bounding boxes, the label shape and image augmentation (e.g., random 
cropping) are changed. 


Exercises 


1. Referring to the MXNet documentation, what are the parameters for the constructors of the 
image.ImageDetIter and image .CreateDetAugmenter classes? What is their significance? 


Discussions!* 


13.7 Single Shot Multibox Detection (SSD) 


In the previous few sections, we have introduced bounding boxes, anchor boxes, multiscale ob- 
ject detection, and datasets. Now, we will use this background knowledge to construct an object 
detection model: single shot multibox detection (SSD) (Liu et al., 2016). This quick and easy model 
is already widely used. Some of the design concepts and implementation details of this model are 
also applicable to other object detection models. 


13.7.1 Model 


Fig. 13.7.1 shows the design of an SSD model. The model’s main components are a base network 
block and several multiscale feature blocks connected in a series. Here, the base network block is 
used to extract features of original images, and it generally takes the form of a deep convolutional 
neural network. The paper on SSDs chooses to place a truncated VGG before the classification 
layer (Liu et al., 2016), but this is now commonly replaced by ResNet. We can design the base 
network so that it outputs larger heights and widths. In this way, more anchor boxes are gen- 
erated based on this feature map, allowing us to detect smaller objects. Next, each multiscale 
feature block reduces the height and width of the feature map provided by the previous layer (for 
example, it may reduce the sizes by half). The blocks then use each element in the feature map to 
expand the receptive field on the input image. In this way, the closer a multiscale feature block 
is to the top of Fig. 13.7.1 the smaller its output feature map, and the fewer the anchor boxes that 
are generated based on the feature map. In addition, the closer a feature block is to the top, the 
larger the receptive field of each element in the feature map and the better suited it is to detect 
larger objects. As the SSD generates different numbers of anchor boxes of different sizes based 
on the base network block and each multiscale feature block and then predicts the categories and 
offsets (i.e., predicted bounding boxes) of the anchor boxes in order to detect objects of different 
sizes, SSD is a multiscale object detection model. 
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Fig. 13.7.1: The SSD is composed of a base network block and several multiscale feature blocks 
connected in a series. 


Next, we will describe the implementation of the modules in Fig. 13.7.1. First, we need to discuss 
the implementation of category prediction and bounding box prediction. 


Category Prediction Layer 


Set the number of object categories to q. In this case, the number of anchor box categories is 
q + 1, with 0 indicating an anchor box that only contains background. For a certain scale, set 
the height and width of the feature map to h and w, respectively. If we use each element as the 
center to generate a anchor boxes, we need to classify a total of hwa anchor boxes. If we use a 
fully connected layer (FCN) for the output, this will likely result in an excessive number of model 
parameters. Recall how we used convolutional layer channels to output category predictions in 
Section 7.3. SSD uses the same method to reduce the model complexity. 


Specifically, the category prediction layer uses a convolutional layer that maintains the input 
height and width. Thus, the output and input have a one-to-one correspondence to the spatial 
coordinates along the width and height of the feature map. Assuming that the output and input 
have the same spatial coordinates (zx, y), the channel for the coordinates (x, y) on the output fea- 
ture map contains the category predictions for all anchor boxes generated using the input feature 
map coordinates (x, y) as the center. Therefore, there are a(q + 1) output channels, with the out- 
put channels indexed as i(q+1)+ 7 (0 < j < q) representing the predictions of the category index 
j for the anchor box index i. 


Now, we will define a category prediction layer of this type. After we specify the parameters a and 
q, it uses a 3 x 3 convolutional layer with a padding of 1. The heights and widths of the input and 
output of this convolutional layer remain unchanged. 


%matplotlib inline 
from d21 import mxnet as d21 
from mxnet import autograd, gluon, image, init, np, npx 


(continues on next page) 
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(continued from previous page) 


from mxnet.gluon import nn 
npx.set_np() 


def cls_predictor(num_anchors, num_classes): 
return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3, 
padding=1) 


Bounding Box Prediction Layer 


The design of the bounding box prediction layer is similar to that of the category prediction layer. 
The only difference is that, here, we need to predict 4 offsets for each anchor box, rather than q+ 1 
categories. 


def bbox_predictor(num_anchors): 
return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1) 


Concatenating Predictions for Multiple Scales 


As we mentioned, SSD uses feature maps based on multiple scales to generate anchor boxes and 
predict their categories and offsets. Because the shapes and number of anchor boxes centered on 
the same element differ for the feature maps of different scales, the prediction outputs at different 
scales may have different shapes. 


In the following example, we use the same batch of data to construct feature maps of two different 
scales, Y1 and Y2. Here, Y2 has half the height and half the width of Y1. Using category prediction 
as an example, we assume that each element in the Y1 and Y2 feature maps generates five (Y1) or 
three (Y2) anchor boxes. When there are 10 object categories, the number of category prediction 
output channels is either 5 x (10 + 1) = 55 or 3 x (10+ 1) = 33. The format of the prediction 
output is (batch size, number of channels, height, width). As you can see, except for the batch 
size, the sizes of the other dimensions are different. Therefore, we must transform them into a 
consistent format and concatenate the predictions of the multiple scales to facilitate subsequent 
computation. 


def forward(x, block): 
block. initialize() 
return block(x) 


Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10)) 
Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10)) 
(Y1.shape, Y2.shape) 


((2, 55, 20, 20), (2, 33, 10, 10)) 


The channel dimension contains the predictions for all anchor boxes with the same center. We 
first move the channel dimension to the final dimension. Because the batch size is the same for 
all scales, we can convert the prediction results to binary format (batch size, height x width x 
number of channels) to facilitate subsequent concatenation on the 1% dimension. 
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def flatten_pred(pred): 
return npx.batch_flatten(pred.transpose(0, 2, 3, 1)) 


def concat_preds(preds): 
return np.concatenate([flatten_pred(p) for p in preds], axis=1) 


Thus, regardless ofthe different shapes of Y1 and Y2, we can still concatenate the prediction results 
for the two different scales of the same batch. 


concat_preds([Y1, Y2]).shape 


(2, 25300) 


Height and Width Downsample Block 


For multiscale object detection, we define the following down_sample_b1k block, which reduces 
the height and width by 50%. This block consists of two 3 x 3 convolutional layers with a padding 
of 1 and a 2 x 2 maximum pooling layer with a stride of 2 connected in a series. As we know, 3 x 3 
convolutional layers with a padding of 1 do not change the shape of feature maps. However, the 
subsequent pooling layer directly reduces the size of the feature map by half. Because 1 x 2 + 
(3 — 1) + (3 — 1) = 6, each element in the output feature map has a receptive field on the input 
feature map of the shape 6 x 6. As you can see, the height and width downsample block enlarges 
the receptive field of each element in the output feature map. 


def down_sample_blk(num_channels) : 
blk = nn.Sequential() 
for _ in range(2): 
blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1), 

nn.BatchNorm(in_channels=num_channels) , 
nn.Activation('relu')) 

b1k. add(nn.MaxPoo12D(2)) 

return blk 


By testing forward computation in the height and width downsample block, we can see that it 
changes the number of input channels and halves the height and width. 


forward(np.zeros((2, 3, 20, 20)), down_sample_b1k(10)).shape 


(2, 10, 10, 10) 
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Base Network Block 


The base network block is used to extract features from original images. To simplify the compu- 
tation, we will construct a small base network. This network consists of three height and width 
downsample blocks connected in a series, so it doubles the number of channels at each step. 
When we input an original image with the shape 256 x 256, the base network block outputs a 
feature map with the shape 32 x 32. 


def base_net(): 
blk = nn.Sequential() 
for num_filters in [16, 32, 64]: 
blk. add(down_sample_bl1k(num_filters)) 
return blk 


forward(np.zeros((2, 3, 256, 256)), base_net()).shape 


(2.64) 32, 32) 


The Complete Model 


The SSD model contains a total of five modules. Each module outputs a feature map used to gen- 
erate anchor boxes and predict the categories and offsets of these anchor boxes. The first module 
is the base network block, modules two to four are height and width downsample blocks, and the 
fifth module is a global maximum pooling layer that reduces the height and width to 1. Therefore, 
modules two to five are all multiscale feature blocks shown in Fig. 13.7.1. 


def get_blk(i): 


if i == 

blk = base_net() 
elif i == 4: 

blk = nn.GlobalMaxPool2D() 
else: 

blk = down_sample_b1k(128) 
return blk 


Now, we will define the forward computation process for each module. In contrast to the 
previously-described convolutional neural networks, this module not only returns feature map 
Y output by convolutional computation, but also the anchor boxes of the current scale generated 
from Y and their predicted categories and offsets. 


def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor): 
Y = blk(X) 
anchors = d21.multibox_prior(Y, sizes=size, ratios=ratio) 
cls_preds = cls_predictor(Y) 
bbox_preds = bbox_predictor(Y) 
return (Y, anchors, cls_preds, bbox_preds) 


As we mentioned, the closer a multiscale feature block is to the top in Fig. 13.7.1, the larger the 
objects it detects and the larger the anchor boxes it must generate. Here, we first divide the interval 
from 0.2 to 1.05 into five equal parts to determine the sizes of smaller anchor boxes at different 
scales: 0.2, 0.37, 0.54, etc. Then, according to y0.2 x 0.37 = 0.272, y0.37 x 0.54 = 0.447, and 
similar formulas, we determine the sizes of larger anchor boxes at the different scales. 
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sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79], 
[0.88, 0.961]] 

ratios = [[1, 2, @.5]] x 5 

num_anchors = len(sizes[0]) + len(ratios[@]) - 1 


Now, we can define the complete model, TinySSD. 


class TinySSD(nn.Block): 
def __init__(self, num_classes, **kwargs): 

super(TinySSD, self).__init__(**kwargs) 

self.num_classes = num_classes 

for i in range(5): 
# The assignment statement is self.blk_i = get_blk(i) 
setattr(self, f’blk_{i}'’, get_blk(i)) 
setattr (self, f'cls_(i)', cls_predictor(num_anchors, num_classes)) 
setattr(self, f’bbox_{i}’, bbox_predictor(num_anchors)) 


def forward(self, X): 
anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5 
for i in range(5): 
# getattr(self, 'blk_%d' % i) accesses self.blk_i 
X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward( 
X, getattr(self, f’blk_{i}'), sizes[i], ratios[il, 
getattr(self, f'cls_(i)'), getattr(self, f’bbox_{i}’)) 
# In the reshape function, @ indicates that the batch size remains 
# unchanged 
anchors = np.concatenate(anchors, axis=1) 
cls_preds = concat_preds(cls_preds) 
cls_preds = cls_preds.reshape( 
cls_preds.shapelQ], -1, self.num_classes + 1) 
bbox_preds = concat_preds(bbox_preds) 
return anchors, cls_preds, bbox_preds 


We now create an SSD model instance and use it to perform forward computation on image mini- 
batch X, which has a height and width of 256 pixels. As we verified previously, the first module 
outputs a feature map with the shape 32 x 32. Because modules two to four are height and width 
downsample blocks, module five is a global pooling layer, and each element in the feature map is 
used as the center for 4 anchor boxes, a total of (32? + 16? + 8? + 4? +1) x 4 = 5444 anchor boxes 


are generated for each image at the five scales. 


net = TinySSD(num_classes=1) 
net.initialize() 

X = np.zeros((32, 3, 256, 256)) 
anchors, cls_preds, bbox_preds = net(X) 


print('output anchors:’, anchors.shape) 
print('output class preds:', cls_preds.shape) 
print('output bbox preds:’, bbox_preds.shape) 


output anchors: (1, 5444, 4) 
output class preds: (32, 5444, 2) 
output bbox preds: (32, 21776) 
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13.7.2 Training 


Now, we will explain, step by step, how to train the SSD model for object detection. 


Data Reading and Initialization 


We read the banana detection dataset we created in the previous section. 


batch_size = 32 
train_iter, _ = d21.load_data_bananas(batch_size) 


read 1000 training examples 
read 100 validation examples 


There is 1 category in the banana detection dataset. After defining the module, we need to initial- 
ize the model parameters and define the optimization algorithm. 


device, net = d21.try_gpu(), TinySSD(num_classes=1) 

net.initialize(init=init.Xavier(), ctx=device) 

trainer = gluon.Trainer(net.collect_params(), ‘sgd’, 
£'learning_rate': 0.2, 'wd': 5e-4}) 


Defining Loss and Evaluation Functions 


Object detection is subject to two types of losses. The first is anchor box category loss. For this, we 
can simply reuse the cross-entropy loss function we used in image classification. The second loss 
is positive anchor box offset loss. Offset prediction is a normalization problem. However, here, 
we do not use the squared loss introduced previously. Rather, we use the Lı norm loss, which 
is the absolute value of the difference between the predicted value and the ground-truth value. 
The mask variable bbox_masks removes negative anchor boxes and padding anchor boxes from 
the loss calculation. Finally, we add the anchor box category and offset losses to find the final loss 
function for the model. 


cls_loss = gluon.loss.SoftmaxCrossEntropyLoss() 
bbox_loss = gluon.loss.LiLoss() 


def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks): 
cls = cls_loss(cls_preds, cls_labels) 
bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks) 
return cls + bbox 


We can use the accuracy rate to evaluate the classification results. As we use the Lı norm loss, we 
will use the average absolute error to evaluate the bounding box prediction results. 


def cls_eval(cls_preds, cls_labels): 
# Because the category prediction results are placed in the final 
# dimension, argmax must specify this dimension 
return float((cls_preds.argmax(axis=-1).astype( 
cls_labels.dtype) == cls_labels).sum()) 


(continues on next page) 
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(continued from previous page) 


def bbox_eval(bbox_preds, bbox_labels, bbox_masks): 
return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum()) 


Training the Model 


During model training, we must generate multiscale anchor boxes (anchors) in the model's for- 
ward computation process and predict the category (c1s_preds) and offset (bbox_preds) for each 
anchor box. Afterwards, we label the category (cls_labels) and offset (bbox_labels) of each gen- 
erated anchor box based on the label information Y. Finally, we calculate the loss function using 
the predicted and labeled category and offset values. To simplify the code, we do not evaluate the 
training dataset here. 


num_epochs, timer = 20, d21.Timer() 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], 
legend=['class error’, 'bbox mae’]) 
for epoch in range(num_epochs): 
# accuracy_sum, mae_sum, num_examples, num_labels 
metric = d21.Accumulator (4) 
for features, target in train_iter: 
timer.start() 
X = features.as_in_ctx(device) 
Y = target.as_in_ctx(device) 
with autograd.record(): 
# Generate multiscale anchor boxes and predict the category and 
# offset of each 
anchors, cls_preds, bbox_preds = net(X) 
# Label the category and offset of each anchor box 
bbox_labels, bbox_masks, cls_labels = d21.multibox_target(anchors, 
Y) 
# Calculate the loss function using the predicted and labeled 
# category and offset values 
l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, 
bbox_masks) 
1. backward() 
trainer.step(batch_size) 
metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size, 
bbox_eval(bbox_preds, bbox_labels, bbox_masks), 
bbox_labels. size) 
cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3] 
animator.add(epoch + 1, (cls_err, bbox_mae)) 
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae: .2e}’) 
print(f’{len(train_iter._dataset) / timer.stop():.1f} examples/sec on ' 
f'{str(device) }’) 


class err 3.53e-03, bbox mae 3.81e-03 
2709.1 examples/sec on gpu(Q) 
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13.7.3 Prediction 


In the prediction stage, we want to detect all objects of interest in the image. Below, we read the 
test image and transform its size. Then, we convert it to the four-dimensional format required by 
the convolutional layer. 


img = image.imread('../img/banana.jpg') 
feature = image.imresize(img, 256, 256).astype('float32') 
X = np.expand_dims(feature.transpose(2, 0, 1), axis=0) 


Using the multibox_detection function, we predict the bounding boxes based on the anchor boxes 
and their predicted offsets. Then, we use non-maximum suppression to remove similar bounding 
boxes. 


def predict(X): 
anchors, cls_preds, bbox_preds = net(X.as_in_ctx(device)) 
cls_probs = npx.softmax(cls_preds).transpose(@, 2, 1) 
output = d21.multibox_detection(cls_probs, bbox_preds, anchors) 
idx = [i for i, row in enumerate(output[0]) if row[0] != -1] 
return output[0, idx] 


output = predict(X) 


Finally, we take all the bounding boxes with a confidence level of at least 0.9 and display them as 
the final output. 


def display(img, output, threshold): 

d21.set_figsize((5, 5)) 

fig = d21.p1t.imshow(img.asnumpy()) 

for row in output: 
score = float(row[1]) 
if score < threshold: 

continue 

h, w = img.shapeL@: 2] 
bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.ctx)] 
d21.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w') 


display(img, output, threshold=0.9) 
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Summary 


e SSD is a multiscale object detection model. This model generates different numbers of an- 
chor boxes of different sizes based on the base network block and each multiscale feature 
block and predicts the categories and offsets ofthe anchor boxes to detect objects of different 
sizes. 


e During SSD model training, the loss function is calculated using the predicted and labeled 
category and offset values. 


Exercises 


1. Due to space limitations, we have ignored some of the implementation details of the SSD 
model in this experiment. Can you further improve the model in the following areas? 


Loss Function 


A. For the predicted offsets, replace Lı norm loss with Lı regularization loss. This loss function 
uses a square function around zero for greater smoothness. This is the regularized area controlled 
by the hyperparameter o: 


f(z) = fae ee (13.7.1) 


lx] —0.5/0?, otherwise 


When c is large, this loss is similar to the Lı norm loss. When the value is small, the loss function 
is smoother. 
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sigmas = [10, 1, 0.5] 
NS A] 
x = np.arange(-2, 2, 0.1) 
d21.set_figsize() 


for 1, s in zip(lines, sigmas): 
y = npx.smooth_11(x, scalar=s) 


d21.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s) 


d21.plt.legend(); 





— sigma=10.0 
==- sigma=1.0 
—-- sigma=0.5 


In the experiment, we used cross-entropy loss for category prediction. Now, assume that the pre- 
diction probability of the actual category j is p; and the cross-entropy loss is —logp;. We can 
also use the focal loss (Lin et al., 2017a). Given the positive hyperparameters y and a, this loss is 


defined as: 


—a(1 — py)" log pz. 


(13.7.2) 


As you can see, by increasing y, we can effectively reduce the loss when the probability of predict- 


ing the correct category is high. 


def focal_loss(gamma, x): 


return -(1 - x) ** gamma * np.log(x) 


x = np.arange(0.01, 1, 0.01) 


for l, gamma in zip(lines, [0, 1, 5J): 


y = d21.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), 1, 
label='gamma=%.if' % gamma) 


d21.plt.legend(); 
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—— gamma=0.0 
=-=- gamma=1.0 
—-- gamma=5.0 





Training and Prediction 


B. When an object is relatively large compared to the image, the model normally adopts a larger 
input image size. 


C. This generally produces a large number of negative anchor boxes when labeling anchor box 
categories. We can sample the negative anchor boxes to better balance the data categories. To do 
this, we can define a negative_mining_ratio parameter in the multibox_target function. 


D. Assign hyperparameters with different weights to the anchor box category loss and positive 
anchor box offset loss in the loss function. 


E. Refer to the SSD paper. What methods can be used to evaluate the precision of object detection 
models (Liu et al., 2016)? 


Discussions19* 


13.8 Region-based CNNs (R-CNNs) 


Region-based convolutional neural networks or regions with CNN features (R-CNNs) are a pio- 
neering approach that applies deep models to object detection (Girshick et al., 2014). In this sec- 
tion, we will discuss R-CNNs and a series of improvements made to them: Fast R-CNN (Girshick, 
2015), Faster R-CNN (Ren et al., 2015), and Mask R-CNN (He etal., 2017a). Due to space limitations, 
we will confine our discussion to the designs of these models. 


13.8.1 R-CNNs 


R-CNN models first select several proposed regions from an image (for example, anchor boxes are 
one type of selection method) and then label their categories and bounding boxes (e.g., offsets). 
Then, they use a CNN to perform forward computation to extract features from each proposed 
area. Afterwards, we use the features of each proposed region to predict their categories and 
bounding boxes. Fig. 13.8.1 shows an R-CNN model. 





184 https://discuss.d21.ai/t/373 
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Selective search 





Fig. 13.8.1: R-CNN model. 


Specifically, R-CNNs are composed of four main parts: 


1. 


Selective search is performed on the input image to select multiple high-quality proposed 
regions (Uijlings et al., 2013). These proposed regions are generally selected on multiple 
scales and have different shapes and sizes. The category and ground-truth bounding box of 
each proposed region is labeled. 


. A pre-trained CNN is selected and placed, in truncated form, before the output layer. It 


transforms each proposed region into the input dimensions required by the network and 
uses forward computation to output the features extracted from the proposed regions. 


. The features and labeled category of each proposed region are combined as an example to 


train multiple support vector machines for object classification. Here, each support vector 
machine is used to determine whether an example belongs to a certain category. 


. The features and labeled bounding box of each proposed region are combined as an example 


to train a linear regression model for ground-truth bounding box prediction. 


Although R-CNN models use pre-trained CNNs to effectively extract image features, the main 
downside is the slow speed. As you can imagine, we can select thousands of proposed regions 
from a single image, requiring thousands of forward computations from the CNN to perform ob- 
ject detection. This massive computing load means that R-CNNs are not widely used in actual 
applications. 


13.8.2 Fast R-CNN 


The main performance bottleneck of an R-CNN model is the need to independently extract fea- 
tures for each proposed region. As these regions have a high degree of overlap, independent 
feature extraction results in a high volume of repetitive computations. Fast R-CNN improves on 
the R-CNN by only performing CNN forward computation on the image as a whole. 
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Fig. 13.8.2: Fast R-CNN model. 


Fig. 13.8.2 shows a Fast R-CNN model. Its primary computation steps are described below: 


1. Compared to an R-CNN model, a Fast R-CNN model uses the entire image as the CNN input 
for feature extraction, rather than each proposed region. Moreover, this network is gener- 
ally trained to update the model parameters. Asthe inputis an entire image, the CNN output 
shape is 1 x c x hı x w}. 


2. Assuming selective search generates n proposed regions, their different shapes indicate re- 
gions of interests (Rols) of different shapes on the CNN output. Features of the same shapes 
must be extracted from these Rols (here we assume that the height is ha and the width is 
w2). Fast R-CNN introduces Rol pooling, which uses the CNN output and Rols as input to 
output a concatenation of the features extracted from each proposed region with the shape 
mxcx hax wo. 


3. A fully connected layer is used to transform the output shape to n x d, where dis determined 
by the model design. 


4. During category prediction, the shape of the fully connected layer output is again trans- 
formed to n x q and we use softmax regression (q is the number of categories). During 
bounding box prediction, the shape of the fully connected layer output is again transformed 
tonx4. This means that we predict the category and bounding box for each proposed region. 


The Rol pooling layer in Fast R-CNN is somewhat different from the pooling layers we have dis- 
cussed before. In a normal pooling layer, we set the pooling window, padding, and stride to con- 
trol the output shape. In an Rol pooling layer, we can directly specify the output shape of each 
region, such as specifying the height and width of each region as ha, w2. Assuming that the height 
and width of the Rol window are h and w, this window is divided into a grid of sub-windows with 
the shape ha x w2. The size of each sub-window is about (h/hz) x (w/w2). The sub-window height 
and width must always be integers and the largest element is used as the output for a given sub- 
window. This allows the Rol pooling layer to extract features of the same shape from Rols of 
different shapes. 


In Fig. 13.8.3, we select an 3 x 3 region as an Rol of the 4 x 4 input. For this Rol, we use a 2 x 2 Rol 
pooling layer to obtain a single 2 x 2 output. When we divide the region into four sub-windows, 
they respectively contain the elements 0, 1, 4, and 5 (5 is the largest); 2 and 6 (6 is the largest); 8 
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and 9 (9 is the largest); and 10. 
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Fig. 13.8.3: 2 x 2 Rol pooling layer. 










We use the ROIPooling function to demonstrate the Rol pooling layer computation. Assume that 
the CNN extracts the feature X with both a height and width of 4 and only a single channel. 


from mxnet import np, npx 
npx.set_np() 


X = np.arange(16).reshape(1, 1, 4, 4) 
X 


EmTeEvQUEEE Ws, lag 25,  Sele 
ae ee eee ae 
E Bs Dor 10r 1e 
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Assume that the height and width of the image are both 40 pixels and that selective search gener- 
ates two proposed regions on the image. Each region is expressed as five elements: the region’s 
object category and the x, y coordinates of its upper-left and bottom-right corners. 


rois = np.array([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]]) 


Because the height and width of X are 1/10 of the height and width of the image, the coordinates 
of the two proposed regions are multiplied by 0.1 according to the spatial_scale, and then the 
Rols are labeled on Xas X[:, :, 0:3, 0:3]andX[:, :, 1:4, 0:4], respectively. Finally, we 
divide the two Rols into a sub-window grid and extract features with a height and width of 2. 


npx.roi_pooling(X, rois, pooled_size=(2, 2), spatial_scale=0.1) 


array([[[[ 5., 6.1, 
E 95 190.111, 


EEE San We, 
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13.8.3 Faster R-CNN 


In order to obtain precise object detection results, Fast R-CNN generally requires that many pro- 
posed regions be generated in selective search. Faster R-CNN replaces selective search with a re- 
gion proposal network. This reduces the number of proposed regions generated, while ensuring 


precise object detection. 
Category Bounding box 
prediction prediction 
Binary category 
prediction 


Rol pooling al reee = Anchor box 
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Region proposal network 


Fig. 13.8.4: Faster R-CNN model. 


Fig. 13.8.4 shows a Faster R-CNN model. Compared to Fast R-CNN, Faster R-CNN only changes the 
method for generating proposed regions from selective search to region proposal network. The 
other parts of the model remain unchanged. The detailed region proposal network computation 
process is described below: 


1. We use a 3 x 3 convolutional layer with a padding of 1 to transform the CNN output and set 
the number of output channels to c. This way, each element in the feature map the CNN 
extracts from the image is a new feature with a length of c. 


2. We use each element in the feature map as a center to generate multiple anchor boxes of 
different sizes and aspect ratios and then label them. 


3. We use the features of the elements of length c at the center on the anchor boxes to predict 
the binary category (object or background) and bounding box for their respective anchor 
boxes. 


4. Then, we use non-maximum suppression to remove similar bounding box results that corre- 
spond to category predictions of “object”. Finally, we output the predicted bounding boxes 
as the proposed regions required by the Rol pooling layer. 


It is worth noting that, as a part of the Faster R-CNN model, the region proposal network is trained 
together with the rest of the model. In addition, the Faster R-CNN object functions include the 
category and bounding box predictions in object detection, as well as the binary category and 
bounding box predictions for the anchor boxes in the region proposal network. Finally, the region 
proposal network can learn how to generate high-quality proposed regions, which reduces the 
number of proposed regions while maintaining the precision of object detection. 





13.8. Region-based CNNs (R-CNNs) 617 


13.8.4 Mask R-CNN 


If training data is labeled with the pixel-level positions of each object in an image, a Mask R-CNN 
model can effectively use these detailed labels to further improve the precision of object detection. 


Category Bounding box Mask 
prediction prediction prediction 
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Region proposal network 


Fig. 13.8.5: Mask R-CNN model. 


As shown in Fig. 13.8.5, Mask R-CNN is a modification to the Faster R-CNN model. Mask R-CNN 
models replace the Rol pooling layer with an Rol alignment layer. This allows the use of bilinear 
interpolation to retain spatial information on feature maps, making Mask R-CNN better suited 
for pixel-level predictions. The Rol alignment layer outputs feature maps of the same shape for 
all Rols. This not only predicts the categories and bounding boxes of Rols, but allows us to use 
an additional fully convolutional network to predict the pixel-level positions of objects. We will 
describe how to use fully convolutional networks to predict pixel-level semantics in images later 
in this chapter. 


Summary 


+ An R-CNN model selects several proposed regions and uses a CNN to perform forward com- 
putation and extract the features from each proposed region. It then uses these features to 
predict the categories and bounding boxes of proposed regions. 


+ Fast R-CNN improves on the R-CNN by only performing CNN forward computation on the 
image as a whole. It introduces an Rol pooling layer to extract features of the same shape 
from Rols of different shapes. 


e Faster R-CNN replaces the selective search used in Fast R-CNN with a region proposal net- 
work. This reduces the number of proposed regions generated, while ensuring precise ob- 
ject detection. 


e Mask R-CNN uses the same basic structure as Faster R-CNN, but adds a fully convolution 
layer to help locate objects at the pixel level and further improve the precision of object 
detection. 
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Exercises 


1. Study the implementation of each model in the GluonCV toolkit** related to this section. 


Discussions!*6 


13.9 Semantic Segmentation and the Dataset 


In our discussion of object detection issues in the previous sections, we only used rectangular 
bounding boxes to label and predict objects in images. In this section, we will look at seman- 
tic segmentation, which attempts to segment images into regions with different semantic cate- 
gories. These semantic regions label and predict objects at the pixel level. Fig. 13.9.1 shows a 
semantically-segmented image, with areas labeled “dog”, “cat”, and “background”. As you can 
see, compared to object detection, semantic segmentation labels areas with pixel-level borders, 


for significantly greater precision. 


Background 





D 


Fig. 13.9.1: Semantically-segmented image, with areas labeled “dog”, “cat”, and “background”. 


13.9.1 Image Segmentation and Instance Segmentation 


In the computer vision field, there are two important methods related to semantic segmentation: 
image segmentation and instance segmentation. Here, we will distinguish these concepts from 
semantic segmentation as follows: 


e Image segmentation divides an image into several constituent regions. This method gen- 
erally uses the correlations between pixels in an image. During training, labels are not 
needed for image pixels. However, during prediction, this method cannot ensure that the 
segmented regions have the semantics we want. If we input the image in 9.10, image seg- 
mentation might divide the dog into two regions, one covering the dog's mouth and eyes 
where black is the prominent color and the other covering the rest of the dog where yellow 
is the prominent color. 


Instance segmentation is also called simultaneous detection and segmentation. This 
method attempts to identify the pixel-level regions of each object instance in an image. In 
contrastto semantic segmentation, instance segmentation not only distinguishes semantics, 
but also different object instances. If an image contains two dogs, instance segmentation 
will distinguish which pixels belong to which dog. 





155 hitps://github.com/dmlc/gluon-cv/ 
186 https://discuss.d21.ai/t/374 
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13.9.2 The Pascal VOC2012 Semantic Segmentation Dataset 


In the semantic segmentation field, one important dataset is Pascal VOC2012'*’. To better under- 
stand this dataset, we must first import the package or module needed for the experiment. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import gluon, image, np, npx 
import os 


npx.set_np() 


The original site might be unstable, so we download the data from a mirror site. The archive is 
about 2 GB, so it will take some time to download. After you decompress the archive, the dataset 
is located in the ../data/VOCdevkit/VOC2012 path. 


#@save 
d21.DATA_HUB['voc2012'] = (d21.DATA_URL + 'VOCtrainval_11-May-2012.tar', 
"4e443f8a2eca6bldac8a6c57641b67dd40621a49’) 


voc_dir = d21.download_extract('voc2012', 'VOCdevkit/VOC2012’) 


Go to ../data/VOCdevkit/VOC2012 to see the different parts of the dataset. The ImageSets/ 
Segmentation path contains text files that specify the training and testing examples. The JPEGIm- 
ages and SegmentationClass paths contain the example input images and labels, respectively. 
These labels are also in image format, with the same dimensions as the input images to which 
they correspond. In the labels, pixels with the same color belong to the same semantic category. 
The read_voc_images function defined below reads all input images and labels to the memory. 


#@save 
def read_voc_images(voc_dir, is_train=True): 
"""Read all VOC feature and label images.””” 
txt_fname = os.path.join(voc_dir, 'ImageSets', ‘Segmentation’, 
'train.txt' if is_train else 'val.txt') 
with open(txt_fname, 'r') as f: 
images = f.read() .split() 
features, labels = [], [] 
for i, fname in enumerate(images): 
features.append(image. imread(os.path. join( 
voc_dir, 'JPEGImages', f’{fname}.jpg’))) 
labels. append(image. imread(os. path. join( 
voc_dir, 'SegmentationClass', f’{fname}.png’))) 
return features, labels 


train_features, train_labels = read_voc_images(voc_dir, True) 


We draw the first five input images and their labels. In the label images, white represents borders 
and black represents the background. Other colors correspond to different categories. 


n=5 
imgs = train_features[0:n] + train_labels[0:n] 
d21.show_images(imgs, 2, n); 





187 http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ 
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Next, we list each RGB color value in the labels and the categories they label. 


#@save 

VOC_COLORMAP = [[@, 0, 0], [128, 0, 0], [o, 128, ©], [128, 128, 0], 
LO, @, Wa), (Le. O, zea, O, 128, Mel, (Lees AE EA, 
[64, 0, 071, [192, 0, 0], [64, 128, 0], [192, 128, 01, 
[64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128], 
E, 04, Pl, las, 6% Yl, 10, 192, Y, 2s, 182, Ol), 


[0, 64, 128]] 

#@save 

VOC_CLASSES = ['background', ‘aeroplane’, ‘bicycle’, 'bird’, ‘boat’, 
Dot US Ica ca pe DC O Wii 
'diningtable', 'dog', 'horse', 'motorbike', 'person', 
"potted plant’, 'sheep', 'sofa', ‘train’, 'tv/monitor'] 


After defining the two constants above, we can easily find the category index for each pixel in the 
labels. 


#@save 
def build_colormap2label(): 
"""Build an RGB color to label mapping for segmentation. 
colormap2label = np.zeros(256 ** 3) 
for i, colormap in enumerate(VOC_COLORMAP) : 
colormap2label[(colormap[0]*256 + colormap[1])*256 + colormap[2]] = i 
return colormap2label 


nnn 


#@save 
def voc_label_indices(colormap, colormap2label): 
"""Map an RGB color to a label.””” 
colormap = colormap.astype(np.int32) 
idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 11) * 256 
+ colormap[:, :, 21) 
return colormap2label[idx] 


For example, in the first example image, the category index for the front part of the airplane is 1 
and the index for the background is 0. 


y = voc_label_indices(train_labels[0], build_colormap2label()) 
y[105:115, 130:140], VOC_CLASSES[1] 
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(EXTENSO: 5 Dor Deo Dep Day Don Des Dep Doy Lol, 
Ean Oos Oon Osa Oer Oe Oor leoa ep ioi 
e e ie ll alo a 
a ios ala les ral: lies ds lle 
des Wer Ory Wes Ory tay Mos Meg ten Hed] 
o a tro Esa le os ls 1) 
EA Waa IA 
[ECP aC O O E Sie 
A Ory O a e A 
Os, Doy Des Day Ory Das Oey Os doy ety, 





"aeroplane’ ) 


Data Preprocessing 


In the preceding chapters, we scaled images to make them fit the input shape of the model. In 
semantic segmentation, this method would require us to re-map the predicted pixel categories 
back to the original-size input image. It would be very difficult to do this precisely, especially in 
segmented regions with different semantics. To avoid this problem, we crop the images to set 
dimensions and do not scale them. Specifically, we use the random cropping method used in 
image augmentation to crop the same region from input images and their labels. 


#@save 

def voc_rand_crop(feature, label, height, width): 
"""Randomly crop for both feature and label images. 
feature, rect = image.random_crop(feature, (width, height)) 
label = image.fixed_crop(label, *rect) 
return feature, label 


nnn 


imgs = [] 
for _ in range(n): 


imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300) 
d21.show_images(imgs[::2] + imgs[1::2], 2, n); 
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Dataset Classes for Custom Semantic Segmentation 


We use the inherited Dataset class provided by Gluon to customize the semantic segmentation 
dataset class VOCSegDataset. Byimplementingthe __getitem__ function, we can arbitrarily access 
the input image with the index idx and the category indexes for each of its pixels from the dataset. 
As some images in the dataset may be smaller than the output dimensions specified for random 
cropping, we must remove these example by using a custom filter function. In addition, we 
define the normalize_image function to normalize each of the three RGB channels of the input 
images. 


#@save 
class VOCSegDataset(gluon.data.Dataset): 
"""k customized dataset to load VOC dataset.”"" 


def __init__(self, is_train, crop_size, voc_dir): 
self.rgb_mean = np.array([0.485, 0.456, 0.406]) 
self .rgb_std = np.array([0.229, 0.224, 0.225]) 
self.crop_size = crop_size 
features, labels = read_voc_images(voc_dir, is_train=is_train) 
self.features = [self.normalize_image(feature) 
for feature in self.filter(features)] 
self.labels = self.filter(labels) 
self.colormap2label = build_colormap2label () 
print('read * + str(len(self.features)) + * examples’) 


def normalize_image(self, img): 
return (img.astype('float32') / 255 - self.rgb_mean) / self.rgb_std 


def filter(self, imgs): 
return [img for img in imgs if ( 
img.shape[Q] >= self .crop_size[0] and 
img.shape[1] >= self.crop_size[1])] 


def __getitem__(self, idx): 
feature, label = voc_rand_crop(self.features[idx], self.labels[idx], 
*xself.crop_size) 
return (feature.transpose(2, 0, 1), 
voc_label_indices(label, self.colormap2label)) 


def __len__(self): 
return len(self.features) 


Reading the Dataset 


Using the custom VOCSegDataset class, we create the training set and testing set instances. We 
assume the random cropping operation output images in the shape 320 x 480. Below, we can see 
the number of examples retained in the training and testing sets. 


crop_size = (320, 480) 
voc_train = VOCSegDataset(True, crop_size, voc_dir) 
voc_test = VOCSegDataset(False, crop_size, voc_dir) 
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read 1114 examples 
read 1078 examples 


We set the batch size to 64 and define the iterators for the training and testing sets. Print the shape 
of the first minibatch. In contrast to image classification and object recognition, labels here are 
three-dimensional arrays. 


batch_size = 64 
train_iter = gluon.data.DataLoader(voc_train, batch_size, shuffle=True, 
last_batch='discard', 
num_workers=d21.get_dataloader_workers()) 
for X, Y in train_iter: 
print(X. shape) 
print(Y.shape) 
break 


(64, 3, 320, 480) 
(64, 320, 480) 


Putting All Things Together 


Finally, we define a function load_data_voc that downloads and loads this dataset, and then re- 
turns the data iterators. 


#@save 
def load_data_voc(batch_size, crop_size): 
"""Download and load the VOC2012 semantic dataset.””" 
voc_dir = d21.download_extract('voc2012', os.path. join( 
'VOCdevkit', 'VOC2012')) 
num_workers = d21.get_dataloader_workers() 
train_iter = gluon.data.DataLoader( 
voCSegDataset(True, crop_size, voc_dir), batch_size, 
shuffle=True, last_batch='discard', num_workers=num_workers) 
test_iter = gluon.data.DataLoader ( 
vOCSegDataset(False, crop_size, voc_dir), batch_size, 
last_batch='discard', num_workers=num_workers) 
return train_iter, test_iter 


Summary 
e Semantic segmentation looks at how images can be segmented into regions with different 
semantic categories. 
+ In the semantic segmentation field, one important dataset is Pascal VOC2012. 


+ Because the input images and labels in semantic segmentation have a one-to-one correspon- 
dence at the pixel level, we randomly crop them to a fixed size, rather than scaling them. 
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Exercises 


1. Recall the content we covered in Section 13.1. Which of the image augmentation methods 
used in image classification would be hard to use in semantic segmentation? 


Discussions!*8 


13.10 Transposed Convolution 


The layers we introduced so far for convolutional neural networks, including convolutional lay- 
ers (Section 6.2) and pooling layers (Section 6.5), often reduce the input width and height, or 
keep them unchanged. Applications such as semantic segmentation (Section 13.9) and generative 
adversarial networks (Section 17.2), however, require to predict values for each pixel and there- 
fore needs to increase input width and height. Transposed convolution, also named fractionally- 
strided convolution (Dumoulin & Visin, 2016) or deconvolution (Long et al., 2015), serves this pur- 
pose. 


from mxnet import np, npx, init 
from mxnet.gluon import nn 


from d21 import mxnet as d21 


npx.set_np() 


13.10.1 Basic 2D Transposed Convolution 
Let us consider a basic case that both input and output channels are 1, with 0 padding and 1 stride. 
Fig. 13.10.1 illustrates how transposed convolution with a 2 x 2 kernel is computed on the 2 x 2 


input matrix. 


Input Kernel Output 





Fig. 13.10.1: Transposed convolution layer with a 2 x 2 kernel. 


We can implement this operation by giving matrix kernel K and matrix input X. 


def trans_conv(X, K): 
h, w = K.shape 
Y = np.zeros((X.shape[@] + h - 1, X.shape[1] +w - 1)) 
for i in range(X.shape[0]): 
for j in range(X.shape[1]): 
Write Sher id, 8 3 cP val == dE, ql ey IX 
return Y 





188 https://discuss.d21.ai/t/375 
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Remember the convolution computes results by Y[i, j] = (Xfi: i + h, j: j + w] * K). 
sum() (refer to corr2d in Section 6.2), which summarizes input values through the kernel. While 
the transposed convolution broadcasts input values through the kernel, which results in a larger 
output shape. 


Verify the results in Fig. 13.10.1. 
se ip array dilo: M1, 12, al) 


K = np.array([[@., 1], [2, 311) 
trans_conv(X, K) 


a Os 1 
LOs 4. 6. 
ATE 


array(L 


Or we can use nn.Conv2DTranspose to obtain the same results. As nn.Conv2D, both input and kernel 
should be 4-D tensors. 


X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2) 
tconv = nn.Conv2DTranspose(1, kernel_size=2) 
tconv. initialize(init.Constant(K)) 


tconv(X) 

array(COCl De; Dos ili le 
Ios) dl 
E day tog SIII 


13.10.2 Padding, Strides, and Channels 


We apply padding elements to the input in convolution, while they are applied to the output in 
transposed convolution. A 1 x 1 padding means we first compute the output as normal, then 
remove the first/last rows and columns. 


tconv = nn.Conv2DTranspose(1, kernel_size=2, padding=1) 


tconv. initialize(init.Constant(K)) 
tconv(X) 


array([LL[4.111D 


Similarly, strides are applied to outputs as well. 


tconv = nn.Conv2DTranspose(1, kernel_size=2, strides=2) 
tconv.initialize(init.Constant(K)) 


tconv(X) 

anra QUE... Day Des Lol: 
[OR a dl 
POZO Chad le 
ee op o) 
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The multi-channel extension of the transposed convolution is the same as the convolution. When 
the input has multiple channels, denoted by c;, the transposed convolution assigns a kp, x ky kernel 
matrix to each input channel. If the output has a channel size co, then we have a c; x kp x ky kernel 
for each output channel. 


As a result, if we feed X into a convolutional layer f to compute Y = f(X) and create a transposed 
convolution layer g with the same hyperparameters as f except for the output channel set to be 
the channel size of X, then g(Y) should has the same shape as X. Let us verify this statement. 


X = np.random.uniform(size=(1, 10, 16, 16)) 

conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3) 

tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3) 
conv.initialize() 

tconv.initialize() 

tconv(conv(X)).shape == X.shape 


True 


13.10.3 Analogy to Matrix Transposition 


The transposed convolution takes its name from the matrix transposition. In fact, convolution 
operations can also be achieved by matrix multiplication. In the example below, we define a3 x 3 
input X with a 2 x 2 kernel K, and then use corr2d to compute the convolution output. 


np.arange(9.0).reshape(3, 3) 
= np.array([[o, 1], [2, 3]]) 
d21.corr2d(X, K) 


< <A >< 
nl 


arre... Ado], 
Ees AS a) 


Next, we rewrite convolution kernel K as a matrix W. Its shape will be (4,9), where the i” row 
present applying the kernel to the input to generate the i™ output element. 


def kernel2matrix(K): 
k, W = np.zeros(5), np.zeros((4, 9)) 
distal. Kissel] = (STO, ei), Kil, 34] 
WEO, :5], WL1, 1:6], WL2, 3:8], W[3, 4:] = k, k, k, k 
return W 


W = kernel2matrix(K) 
W 


o a | 
oa Oor lop Oon Zas Ska Deo Oos O iy 
Oer Oor Oar Oar lep Oro Zen os o La 
Oos Oog Oso Oog Ocg lan Oco Zop 301) 


Then the convolution operator can be implemented by matrix multiplication with proper reshap- 
ing. 
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Y == np.dot(W, X.reshape(-1)).reshape(2, 2) 


array([[ True, True], 
[ True, True]]) 


We can implement transposed convolution as a matrix multiplication as well by reusing ker- 
nel2matrix. To reuse the generated W, we construct a 2 x 2 input, so the corresponding weight 
matrix will have a shape (9, 4), which is W! . Let us verify the results. 


X = np.array([Lo, 1], [2, 311) 
Y = trans_conv(X, K) 
Y == np.dot(W.T, X.reshape(-1)).reshape(3, 3) 


array([[ True, True, True], 
[ True, True, True], 
[ True, True, True]]) 


Summary 


e Compared to convolutions that reduce inputs through kernels, transposed convolutions 
broadcast inputs. 


e If a convolution layer reduces the input width and height by nu and hp time, respectively. 
Then a transposed convolution layer with the same kernel sizes, padding and strides will 
increase the input width and height by nu and np, respectively. 


e We can implement convolution operations by the matrix multiplication, the corresponding 
transposed convolutions can be done by transposed matrix multiplication. 


Exercises 


1. Is it efficient to use matrix multiplication to implement convolution operations? Why? 


Discussions!®? 


13.11 Fully Convolutional Networks (FCN) 


We previously discussed semantic segmentation using each pixel in an image for category predic- 
tion. A fully convolutional network (FCN) (Long et al., 2015) uses a convolutional neural network 
to transform image pixels to pixel categories. Unlike the convolutional neural networks previously 
introduced, an FCN transforms the height and width of the intermediate layer feature map back 
to the size of input image through the transposed convolution layer, so that the predictions have a 
one-to-one correspondence with input image in spatial dimension (height and width). Given a po- 
sition on the spatial dimension, the output of the channel dimension will be a category prediction 
of the pixel corresponding to the location. 





162 https://discuss.d21.ai/t/376 
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We will first import the package or module needed for the experiment and then explain the trans- 
posed convolution layer. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import gluon, image, init, np, npx 
from mxnet.gluon import nn 


npx.set_np() 


13.11.1 Constructing a Model 


Here, we demonstrate the most basic design of a fully convolutional network model. As shown in 
Fig. 13.11.1, the fully convolutional network first uses the convolutional neural network to extract 
image features, then transforms the number of channels into the number of categories through 
the 1 x 1 convolution layer, and finally transforms the height and width of the feature map to 
the size of the input image by using the transposed convolution layer Section 13.10. The model 
output has the same height and width as the input image and has a one-to-one correspondence 
in spatial positions. The final output channel contains the category prediction of the pixel of the 
corresponding spatial position. 


Background 


Cat 


Transposed Conv 


1x1 Conv 





Fig. 13.11.1: Fully convolutional network. 


Below, we use a ResNet-18 model pre-trained on the ImageNet dataset to extract image features 
and record the network instance as pretrained_net. As you can see, the last two layers of the 
model member variable features are the global average pooling layer GlobalAvgPoo12D and ex- 
ample flattening layer Flatten. The output module contains the fully connected layer used for 
output. These layers are not required for a fully convolutional network. 
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pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True) 
pretrained_net.features[-4:], pretrained_net.output 


(HybridSequential ( 

(0): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False,. 
<+in_channels=512) 

(1): Activation(relu) 

(2): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True, global_ 
<pool=True, pool_type=avg, layout=NCHW) 

(3): Flatten 


), 
Dense(512 -> 1000, linear)) 


Next, we create the fully convolutional network instance net. It duplicates all the neural layers 
except the last two layers of the instance member variable features of pretrained_net and the 
model parameters obtained after pre-training. 


net = nn.HybridSequential() 
for layer in pretrained_net.features[:-2]: 
net .add(layer) 


Given an input of a height and width of 320 and 480 respectively, the forward computation of net 
will reduce the height and width of the input to 1/32 of the original, i.e., 10 and 15. 


X = np.random.uniform(size=(1, 3, 320, 480)) 
net (X) .shape 


(1, 512, 10, 15) 


Next, we transform the number of output channels to the number of categories of Pascal VOC2012 
(21) through the 1 x 1 convolution layer. Finally, we need to magnify the height and width of the 
feature map by a factor of 32 to change them back to the height and width of the input image. 
Recall the calculation method for the convolution layer output shape described in Section 6.3. 
Because (320 — 64 + 16 x 2 + 32)/32 = 10 and (480 — 64 + 16 x 2 + 32)/32 = 15, we construct a 
transposed convolution layer with a stride of 32 and set the height and width of the convolution 
kernel to 64 and the padding to 16. It is not difficult to see that, if the stride is s, the padding is 
s/2 (assuming s/2 is an integer), and the height and width of the convolution kernel are 2s, the 
transposed convolution kernel will magnify both the height and width of the input by a factor of 
S. 


num_classes = 21 
net.add(nn.Conv2D(num_classes, kernel_size=1), 
nn.Conv2DTranspose( 
num_classes, kernel_size=64, padding=16, strides=32)) 
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13.11.2 Initializing the Transposed Convolution Layer 


We already know that the transposed convolution layer can magnify a feature map. In image pro- 
cessing, sometimes we need to magnify the image, i.e., upsampling. There are many methods 
for upsampling, and one common method is bilinear interpolation. Simply speaking, in order to 
get the pixel of the output image at the coordinates (x, y), the coordinates are first mapped to the 
coordinates of the input image (x”, y’). This can be done based on the ratio of the size of three 
input to the size of the output. The mapped values x' and y’ are usually real numbers. Then, 
we find the four pixels closest to the coordinate (z’, y”) on the input image. Finally, the pixels 
of the output image at coordinates (x,y) are calculated based on these four pixels on the input 
image and their relative distances to (z’, y”). Upsampling by bilinear interpolation can be imple- 
mented by transposed convolution layer of the convolution kernel constructed using the following 
bilinear_kernel function. Due to space limitations, we only give the implementation of the bi- 
linear_kernel function and will not discuss the principles of the algorithm. 


def bilinear_kernel(in_channels, out_channels, kernel_size): 
factor = (kernel_size + 1) // 2 
if kernel_size % 2 == 1: 
center = factor - 1 
else: 
center = factor - 0.5 
og = (np.arange(kernel_size).reshape(-1, 1), 
np.arange(kernel_size).reshape(1, -1)) 
filt = (1 - np.abs(og[0] - center) / factor) * \ 
(1 - np.abs(og[1] - center) / factor) 
weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size)) 
weight[range(in_channels), range(out_channels), :, :] = filt 
return np.array(weight) 


Now, we will experiment with bilinear interpolation upsampling implemented by transposed con- 
volution layers. Construct a transposed convolution layer that magnifies height and width of input 
by a factor of 2 and initialize its convolution kernel with the bilinear_kernel function. 


conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) 
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4))) 


Read the image X and record the result of upsampling as Y. In order to print the image, we need 
to adjust the position of the channel dimension. 


img = image.imread('../img/catdog.jpg') 

X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255 
Y = conv_trans(X) 

out_img = Y[0].transpose(1, 2, 0) 


As you can see, the transposed convolution layer magnifies both the height and width of the image 
by a factor of 2. Itis worth mentioning that, besides to the difference in coordinate scale, the image 
magnified by bilinear interpolation and original image printed in Section 13.3 look the same. 


d21.set_figsize() 


print('input image shape:', img.shape) 
d21.plt.imshow(img. asnumpy()); 
print('output image shape:', out_img.shape) 


d21.plt.imshow(out_img.asnumpy()) ; 
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input image shape: (561, 728, 3) 
output image shape: (1122, 1456, 3) 


200 
400 
600 


800 





1000 


0 500 1000 


In a fully convolutional network, we initialize the transposed convolution layer for upsampled 
bilinear interpolation. Fora 1 x 1 convolution layer, we use Xavier for randomly initialization. 


W = bilinear_kernel(num_classes, num_classes, 64) 
net[-1].initialize(init.Constant(W)) 
net[-2].initialize(init=init.Xavier()) 


13.11.3 Reading the Dataset 


We read the dataset using the method described in the previous section. Here, we specify shape 
of the randomly cropped output image as 320 x 480, so both the height and width are divisible by 
32. 


batch_size, crop_size = 32, (320, 480) 
train_iter, test_iter = d21.load_data_voc(batch_size, crop_size) 


Downloading ../data/VOCtrainval_11-May-2012.tar from http: //d21-data.s3-accelerate.amazonaws. 
<com/VOCtrainval_11-May-2012.tar... 

read 1114 examples 

read 1078 examples 


13.11.4 Training 


Now we can start training the model. The loss function and accuracy calculation here are not 
substantially different from those used in image classification. Because we use the channel of the 
transposed convolution layer to predict pixel categories, the axis=1 (channel dimension) option 
is specified in SoftmaxCrossEntropyLoss. In addition, the model calculates the accuracy based on 
whether the prediction category of each pixel is correct. 
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num_epochs, Ir, wd, devices = 5, 0.1, le-3, d21.try_all_gpus() 
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1) 
net.collect_params().reset_ctx(devices) 
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
£'learning_rate': lr, ‘wd’: wd)) 
d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.329, train acc 0.892, test acc 0.852 
193.7 examples/sec on [gpu(@), gpu(1)] 


1.0 
0.8 
0.6 
0.4 
—— train loss 
0.2 + =-- train acc 
—-- test acc 
0.0 





13.11.5 Prediction 


During predicting, we need to standardize the input image in each channel and transform them 
into the four-dimensional input format required by the convolutional neural network. 


def predict(img): 
X = test_iter._dataset.normalize_image(img) 
X = np.expand_dims(X.transpose(2, 0, 1), axis=0) 
pred = net(X.as_in_ctx(devices[0])).argmax(axis=1) 
return pred.reshape(pred.shape[1], pred.shape[2]) 


To visualize the predicted categories for each pixel, we map the predicted categories back to their 
labeled colors in the dataset. 


def label2image(pred): 
colormap = np.array(d21.VOC_COLORMAP, ctx=devices[0], dtype=’uint8') 
X = pred.astype('int32') 
return colormap[X, :] 


The size and shape of the images in the test dataset vary. Because the model uses a transposed 
convolution layer with a stride of 32, when the height or width of the input image is not divisible 
by 32, the height or width of the transposed convolution layer output deviates from the size of the 
input image. In order to solve this problem, we can crop multiple rectangular areas in the image 
with heights and widths as integer multiples of 32, and then perform forward computation on the 
pixels in these areas. When combined, these areas must completely cover the input image. When 
a pixel is covered by multiple areas, the average of the transposed convolution layer output in the 
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forward computation of the different areas can be used as an input for the softmax operation to 
predict the category. 


For the sake of simplicity, we only read a few large test images and crop an area with a shape of 
320 x 480 from the top-left corner of the image. Only this area is used for prediction. Forthe input 
image, we print the cropped area first, then print the predicted result, and finally print the labeled 
category. 


voc_dir = d21.download_extract('voc2012', 'VOCdevkit/VOC2012’) 
test_images, test_labels = d21.read_voc_images(voc_dir, False) 
n, imgs = 4, [] 
for i in range(n): 

crop_rect = (0, 0, 480, 320) 

X = image. fixed_crop(test_images[i], *crop_rect) 

pred = label2image(predict(X)) 

imgs += [X, pred, image.fixed_crop(test_labelsLi], *crop_rect)] 
d21.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); 





Summary 


° The fully convolutional network first uses the convolutional neural network to extract image 
features, then transforms the number of channels into the number of categories through the 
1 x 1 convolution layer, and finally transforms the height and width of the feature map to 
the size of the input image by using the transposed convolution layer to output the category 
of each pixel. 


* In a fully convolutional network, we initialize the transposed convolution layer for upsam- 
pled bilinear interpolation. 
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Exercises 


1. If we use Xavier to randomly initialize the transposed convolution layer, what will happen 
to the result? 


2. Can you further improve the accuracy of the model by tuning the hyperparameters? 
3. Predict the categories of all pixels in the test image. 


4. The outputs of some intermediate layers of the convolutional neural network are also used 
in the paper on fully convolutional networks (Long et al., 2015). Try to implement this idea. 


Discussions!” 


13.12 Neural Style Transfer 


If you use social sharing apps or happen to be an amateur photographer, you are familiar with 
filters. Filters can alter the color styles of photos to make the background sharper or people’s 
faces whiter. However, a filter generally can only change one aspect of a photo. To create the 
ideal photo, you often need to try many different filter combinations. This process is as complex 
as tuning the hyperparameters of a model. 


In this section, we will discuss how we can use convolution neural networks (CNNs) to automat- 
ically apply the style of one image to another image, an operation known as style transfer (Gatys 
et al., 2016). Here, we need two input images, one content image and one style image. We use 
a neural network to alter the content image so that its style mirrors that of the style image. In 
Fig. 13.12.1, the content image is a landscape photo the author took in Mount Rainier National 
Part near Seattle. The style image is an oil painting of oak trees in autumn. The output composite 
image retains the overall shapes of the objects in the content image, but applies the oil painting 
brushwork of the style image and makes the overall color more vivid. 


Content image Composite image 


Me A e 
TIRE te ny? 


e 





Style image 





Fig. 13.12.1: Content and style input images and composite image produced by style transfer. 





10 https://discuss.d21.ai/t/377 
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13.12.1 Technique 


The CNN-based style transfer model is shown in Fig. 13.12.2. First, we initialize the composite 
image. For example, we can initialize it as the content image. This composite image is the only 
variable that needs to be updated in the style transfer process, i.e., the model parameter to be 
updated in style transfer. Then, we select a pre-trained CNN to extract image features. These 
model parameters do not need to be updated during training. The deep CNN uses multiple neu- 
ral layers that successively extract image features. We can select the output of certain layers to 
use as content features or style features. If we use the structure in Fig. 13.12.2, the pre-trained 
neural network contains three convolutional layers. The second layer outputs the image content 
features, while the outputs of the first and third layers are used as style features. Next, we use for- 
ward propagation (in the direction of the solid lines) to compute the style transfer loss function 
and backward propagation (in the direction of the dotted lines) to update the model parameter, 
constantly updating the composite image. The loss functions used in style transfer generally have 
three parts: 1. Content loss is used to make the composite image approximate the content im- 
age as regards content features. 2. Style loss is used to make the composite image approximate 
the style image in terms of style features. 3. Total variation loss helps reduce the noise in the 
composite image. Finally, after we finish training the model, we output the style transfer model 
parameters to obtain the final composite image. 


Style loss 


— 
a Conv layer 
Conv layer 






Conv layer 
Conv layer 


Composite ai 
image 


Fig. 13.12.2: CNN-based style transfer process. Solid lines show the direction of forward propaga- 
tion and dotted lines show backward propagation. 


Conv layer 





Content loss 












Content 
image 


Total variation loss 


Next, we will perform an experiment to help us better understand the technical details of style 
transfer. 
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13.12.2 Reading the Content and Style Images 


First, we read the content and style images. By printing out the image coordinate axes, we can see 
that they have different dimensions. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, image, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 

d21.set_figsize() 


content_img = image.imread('../img/rainier.jpg') 
d21.p1t.imshow(content_img.asnumpy()); 
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style_img = image.imread(’../img/autumn-oak. jpg’) 
d21.plt.imshow(style_img.asnumpy()); 
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13.12.3 Preprocessing and Postprocessing 


Below, we define the functions for image preprocessing and postprocessing. The preprocess func- 
tion normalizes each of the three RGB channels of the input images and transforms the results to 
a format that can be input to the CNN. The postprocess function restores the pixel values in the 
output image to their original values before normalization. Because the image printing function 
requires that each pixel has a floating point value from 0 to 1, we use the clip function to replace 
values smaller than 0 or greater than 1 with 0 or 1, respectively. 


rgb_mean = np.array([0.485, 0.456, 0.406]) 
rgb_std = np.array([0.229, 0.224, 0.225]) 


def preprocess(img, image_shape): 
img = image.imresize(img, *image_shape) 
img = (img.astype('float32') / 255 - rgb_mean) / rgb_std 
return np.expand_dims(img.transpose(2, 0, 1), axis=0) 


def postprocess(img): 
img = img[l0].as_in_ctx(rgb_std.ctx) 
return (img.transpose(1, 2, 0) * rgb_std + rgb_mean).clip(0, 1) 


13.12.4 Extracting Features 


We use the VGG-19 model pre-trained on the ImageNet dataset to extract image features[1]. 


pretrained_net = gluon.model_zoo.vision.vggl9(pretrained=True) 


To extract image content and style features, we can select the outputs of certain layers in the VGG 
network. In general, the closer an output is to the input layer, the easier it is to extract image 
detail information. The farther away an output is, the easier it is to extract global information. To 
prevent the composite image from retaining too many details from the content image, we select 
a VGG network layer near the output layer to output the image content features. This layer is 
called the content layer. We also select the outputs of different layers from the VGG network for 
matching local and global styles. These are called the style layers. As we mentioned in Section 7.2, 
VGG networks have five convolutional blocks. In this experiment, we select the last convolutional 
layer of the fourth convolutional block as the content layer and the first layer of each block as style 
layers. We can obtain the indexes for these layers by printing the pretrained_net instance. 


style_layers, content_layers = [0, 5, 10, 19, 28], [25] 


During feature extraction, we only need to use all the VGG layers from the input layer to the content 
or style layer nearest the output layer. Below, we build a new network, net, which only retains the 
layers in the VGG network we need to use. We then use net to extract features. 


net = nn.Sequential() 
for i in range(max(content_layers + style_layers) + 1): 
net.add(pretrained_net.features[i]) 


Given input X, if we simply call the forward computation net (X), we can only obtain the output of 
the last layer. Because we also need the outputs of the intermediate layers, we need to perform 
layer-by-layer computation and retain the content and style layer outputs. 
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def extract_features(X, content_layers, style_layers): 
contents = [] 
styles = [] 
for i in range(len(net)): 
X = netlil00 
if i in style_layers: 
styles. append(X) 
if i in content_layers: 
contents. append(X) 
return contents, styles 


Next, we define two functions: The get_contents function obtains the content features extracted 
from the content image, while the get_styles function obtains the style features extracted from 
the style image. Because we do not need to change the parameters of the pre-trained VGG model 
during training, we can extract the content features from the content image and style features 
from the style image before the start of training. As the composite image is the model parameter 
that must be updated during style transfer, we can only call the extract_features function during 
training to extract the content and style features of the composite image. 


def get_contents(image_shape, device): 
content_X = preprocess(content_img, image_shape).copyto(device) 
contents_Y, _ = extract_features(content_X, content_layers, style_layers) 
return content_X, contents_Y 


def get_styles(image_shape, device): 
style_X = preprocess(style_img, image_shape) . copyto(device) 
_, styles_Y = extract_features(style_X, content_layers, style_layers) 
return style_X, styles_Y 


13.12.5 Defining the Loss Function 


Next, we will look at the loss function used for style transfer. The loss function includes the content 
loss, style loss, and total variation loss. 


Content Loss 


Similar to the loss function used in linear regression, content loss uses a square error function 
to measure the difference in content features between the composite image and content image. 
The two inputs of the square error function are both content layer outputs obtained from the ex- 
tract_features function. 


def content_loss(Y_hat, Y): 
return np.square(Y_hat - Y).mean() 
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Style Loss 


Style loss, similar to content loss, uses a square error function to measure the difference in style 
between the composite image and style image. To express the styles output by the style layers, 
we first use the extract_features function to compute the style layer output. Assuming that the 
output has 1 example, c channels, and a height and width of h and w, we can transform the output 
into the matrix X, which has crows and h-w columns. You can think of matrix X as the combination 
of the c vectors X;,...,X., Which have a length of hw. Here, the vector x; represents the style 
feature of channel i. In the Gram matrix of these vectors XX! € R°*°, element zij in row i column 
j is the inner product of vectors x; and x;. It represents the correlation of the style features of 
channels i and j. We use this type of Gram matrix to represent the style output by the style layers. 
You must note that, when the h- w value is large, this often leads to large values in the Gram matrix. 
In addition, the height and width of the Gram matrix are both the number of channels c. To ensure 
that the style loss is not affected by the size of these values, we define the gram function below to 
divide the Gram matrix by the number of its elements, i.e., c- h - w. 


def gram(X): 
num_channels, n = X.shape[1], X.size // X.shape[1] 
X = X.reshape((num_channels, n)) 
return np.dot(X, X.T) / (num_channels x n) 


Naturally, the two Gram matrix inputs of the square error function for style loss are taken from 
the composite image and style image style layer outputs. Here, we assume that the Gram matrix 
of the style image, gram_Y, has been computed in advance. 


def style_loss(Y_hat, gram_Y): 
return np.square(gram(Y_hat) - gram_Y) .mean() 


Total Variance Loss 


Sometimes, the composite images we learn have a lot of high-frequency noise, particularly bright 
or dark pixels. One common noise reduction method is total variation denoising. We assume that 
xi j represents the pixel value at the coordinate (i, j), so the total variance loss is: 


3 (Zij — Tig15] + [Vig — Vigil. (13.12.1) 
ij 
We try to make the values of neighboring pixels as similar as possible. 


def tv_loss(Y_hat): 
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Loss Function 


The loss function for style transfer is the weighted sum of the content loss, style loss, and total 
variance loss. By adjusting these weight hyperparameters, we can balance the retained content, 
transferred style, and noise reduction in the composite image according to their relative impor- 
tance. 


content_weight, style_weight, tv_weight = 1, 1e3, 10 


def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram): 

# Calculate the content, style, and total variance losses respectively 

contents_1 = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip( 
contents_Y_hat, contents_Y)] 

styles_1 = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip( 
styles_Y_hat, styles_Y_gram)] 

tv_l = tv_loss(X) * tv_weight 

# Add up all the losses 

1 = sum(styles_1 + contents_1 + [tv_1]) 

return contents_1, styles_1, tv_l, 1 


13.12.6 Creating and Initializing the Composite Image 


In style transfer, the composite image is the only variable that needs to be updated. Therefore, we 
can define a simple model, GeneratedImage, and treat the composite image as a model parameter. 
In the model, forward computation only returns the model parameter. 


class GeneratedImage(nn.Block): 
def __init__(self, img_shape, **kwargs): 
super(GeneratedImage, self).__init__(**kwargs) 
self .weight = self .params.get('weight', shape=img_shape) 


def forward(self): 
return self .weight.data() 


Next, we define the get_inits function. This function creates a composite image model instance 
and initializes it to the image X. The Gram matrix for the various style layers of the style image, 
styles_Y_gram, is computed prior to training. 


def get_inits(X, device, lr, styles_Y): 
gen_img = GeneratedImage(X.shape) 
gen_img.initialize(init.Constant(X), ctx=device, force_reinit=True) 
trainer = gluon.Trainer(gen_img.collect_params(), ‘adam’, 
{'learning_rate’: 1r)) 
styles_Y_gram = [gram(Y) for Y in styles_Y] 
return gen_img(), styles_Y_gram, trainer 
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13.12.7 Training 


During model training, we constantly extract the content and style features of the composite im- 
age and calculate the loss function. Recall our discussion of how synchronization functions force 
the front end to wait for computation results in Section 12.2. Because we only call the asnumpy syn- 
chronization function every 10 epochs, the process may occupy a great deal of memory. There- 
fore, we call the waitall synchronization function during every epoch. 


def train(X, contents_Y, styles_Y, device, lr, num_epochs, 1lr_decay_epoch): 
X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y) 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[10, num_epochs], 
legend=['content', ‘style’, 'TV'], 
ncols=2, figsize=(7, 2.5)) 
for epoch in range(num_epochs): 
with autograd.record(): 
contents_Y_hat, styles_Y_hat = extract_features( 
X, content_layers, style_layers) 
contents_1, styles_1, tv_l, 1 = compute_loss( 
X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram) 
1.backward() 
trainer.step(1) 
npx.waitall() 
if (epoch + 1) % 1r_decay_epoch == 0: 
trainer.set_learning_rate(trainer.learning_rate * 0.1) 
if (epoch + 1) % 10 == 0: 
animator.axes[1].imshow(postprocess(X).asnumpy()) 
animator.add(epoch + 1, [float(sum(contents_1)), 
float(sum(styles_1)), float(tv_1)]) 
return X 


Next, we start to train the model. First, we set the height and width of the content and style images 
to 150 by 225 pixels. We use the content image to initialize the composite image. 


device, image_shape = d21.try_gpu(), (225, 150) 
net.collect_params().reset_ctx(device) 

content_X, contents_Y = get_contents(image_shape, device) 

_, styles_Y = get_styles(image_shape, device) 

output = train(content_X, contents_Y, styles_Y, device, 0.01, 500, 200) 


— content 
==- style 
—-- TV 


loss 
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As you can see, the composite image retains the scenery and objects of the content image, while 
introducing the color of the style image. Because the image is relatively small, the details are a bit 
fuzzy. 


To obtain a clearer composite image, we train the model using a larger image size: 900 x 600. We 
increase the height and width of the image used before by a factor of four and initialize a larger 
composite image. 


image_shape = (900, 600) 

_, content_Y = get_contents(image_shape, device) 

_, Style_Y = get_styles(image_shape, device) 

X = preprocess(postprocess(output) * 255, image_shape) 

output = train(X, content_Y, style_Y, device, 0.01, 300, 100) 
d21.plt.imsave(’../img/neural-style. jpg’, postprocess(output) .asnumpy()) 


content 
style 


loss 





100 200 300 
epoch 


As you can see, each epoch takes more time due to the larger image size. As shown in Fig. 13.12.3, 
the composite image produced retains more detail due to its larger size. The composite image not 
only has large blocks of color like the style image, but these blocks even have the subtle texture of 
brush strokes. 
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Fig. 13.12.3: 900 x 600 composite image. 


Summary 


* The loss functions used in style transfer generally have three parts: 1. Content loss is used 
to make the composite image approximate the content image as regards content features. 2. 
Style loss is used to make the composite image approximate the style image in terms of style 
features. 3. Total variation loss helps reduce the noise in the composite image. 


+ We can use a pre-trained CNN to extract image features and minimize the loss function to 
continuously update the composite image. 


* We use a Gram matrix to represent the style output by the style layers. 


Exercises 


1. How does the output change when you select different content and style layers? 


2. Adjust the weight hyperparameters in the loss function. Does the output retain more content 
or have less noise? 


3. Use different content and style images. Can you create more interesting composite images? 


4. Can we apply style transfer for text? Hint: you may refer to the survey paper by Huet al. (Hu 
et al., 2020). 


Discussions)! 





11 https://discuss.d21.ai/t/378 
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13.13 Image Classification (CIFAR-10) on Kaggle 


So far, we have been using Gluon's data package to directly obtain image datasets in the tensor for- 
mat. In practice, however, image datasets often exist in the format of image files. In this section, 
we will start with the original image files and organize, read, and convert the files to the tensor 
format step by step. 


We performed an experiment on the CIFAR-10 dataset in Section 13.1. This is an important data set 
in the computer vision field. Now, we will apply the knowledge we learned in the previous sections 
in order to participate in the Kaggle competition, which addresses CIFAR-10 image classification 
problems. The competition’s web address is 


https://www.kaggle.com/c/cifar-10 


Fig. 13.13.1 shows the information on the competition's webpage. In order to submit the results, 
please register an account on the Kaggle website first. 







BE z -AN CIFAR-10 - Object Recognition in Images 


Sala a RS the bic of 60,000 labeled images 
"LEENE 


Overview Data Discussion Leaderboard Rules 





Overview 

Description CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 
, million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object 

Evaluation 


classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey 
Hinton. 


Fig. 13.13.1: CIFAR-10 image classification competition webpage information. The dataset for the 
competition can be accessed by clicking the “Data” tab. 


First, import the packages or modules required for the competition. 


import collections 

from d21 import mxnet as d21 
import math 

from mxnet import gluon, init, npx 
from mxnet.gluon import nn 

import os 

import pandas as pd 

import shutil 


npx.set_np() 





13.13. Image Classification (CIFAR-10) on Kaggle 645 


13.13.1 Obtaining and Organizing the Dataset 


The competition data is divided into a training set and testing set. The training set contains 50, 000 
images. The testing set contains 300,000 images, of which 10,000 images are used for scoring, 
while the other 290,000 non-scoring images are included to prevent the manual labeling of the 
testing set and the submission of labeling results. The image formats in both datasets are PNG, 
with heights and widths of 32 pixels and three color channels (RGB). The images cover 10 cate- 
gories: planes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks. The upper-left corner 
of Fig. 13.13.1 shows some images of planes, cars, and birds in the dataset. 


Downloading the Dataset 


After logging in to Kaggle, we can click on the “Data” tab on the CIFAR-10 image classification 
competition webpage shown in Fig. 13.13.1 and download the dataset by clicking the “Download 
All” button. After unzipping the downloaded file in ../data, and unzipping train.7z and test.7z 
inside it, you will find the entire dataset in the following paths: 


* ../data/cifar-10/train/[1-50000].png 

e ../data/cifar-10/test/[1-300000].png 

e ,./data/cifar-10/trainLabels.csv 

* ../data/cifar-10/sampleSubmission.csv 


Here folders train and test contain the training and testingimages respectively, trainLabels.csv 
has labels for the training images, and sample_submission.csv is a sample of submission. 


To make it easier to get started, we provide a small-scale sample of the dataset: it contains the first 
1000 training images and 5 random testing images. To use the full dataset of the Kaggle competi- 
tion, you need to set the following demo variable to False. 


#@save 
d21.DATA_HUB[ 'cifar10_tiny'] = (d21.DATA_URL + 'kaggle_cifar10_tiny.zip', 
'2068874e4b9a9f0fb07ebe0ad2b29754449ccacd') 


# If you use the full dataset downloaded for the Kaggle competition, set 
# *demo* to False 
demo = True 


if demo: 

data_dir = d21.download_extract('cifar10_tiny') 
else: 

data_dir = '../data/cifar-10/' 


Downloading ../data/kaggle_cifar10_tiny.zip from http://d21-data.s3-accelerate.amazonaws.com/ 
okaggle_cifar10_tiny.zip... 
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Organizing the Dataset 


We need to organize datasets to facilitate model training and testing. Let us first read the labels 
from the csv file. The following function returns a dictionary that maps the filename without 
extension to its label. 


#@save 
def read_csv_labels(fname) : 
"""Read fname to return a name to label dictionary. 
with open(fname, 'r') as f: 
# Skip the file header line (column name) 
lines = f.readlines()[1:] 
tokens = [l.rstrip(Q).split(’,') for 1 in lines] 
return dict(((name, label) for name, label in tokens)) 


nnn 


labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv')) 
print('t training examples:', len(labels)) 
print(’# classes:’, len(set(labels.values()))) 


# training examples: 1000 
# classes: 10 


Next, we define the reorg_train_valid function to segment the validation set from the original 
training set. The argument valid_ratio in this function is the ratio of the number of examples 
in the validation set to the number of examples in the original training set. In particular, let n 
be the number of images of the class with the least examples, and r be the ratio, then we will 
use max(|nr]|,1) images for each class as the validation set. Let us use valid_ratio=0.1 as an 
example. Since the original training set has 50, 000 images, there will be 45, 000 images used for 
training and stored in the path “train_valid_test/train” when tuning hyperparameters, while 
the other 5, 000 images will be stored as validation set in the path “train_valid_test/valid”. After 
organizing the data, images of the same class will be placed under the same folder so that we can 
read them later. 


#@save 

def copyfile(filename, target_dir): 
"""Copy a file into a target directory. 
os.makedirs(target_dir, exist_ok=True) 
shutil.copy(filename, target_dir) 


non 


#@save 
def reorg_train_valid(data_dir, labels, valid_ratio): 
# The number of examples of the class with the least examples in the 
# training dataset 
n = collections.Counter(labels.values()).most_common()([-1][1] 
# The number of examples per class for the validation set 
n_valid_per_label = max(1, math.floor(n * valid_ratio)) 
label_count = {} 
for train_file in os.listdir(os.path.join(data_dir, 'train')): 
label = labels[train_file.split('.'>[0]] 
fname = os.path.join(data_dir, 'train', train_file) 
# Copy to train_valid_test/train_valid with a subfolder per class 
copyfile(fname, os.path.join(data_dir, 'train_valid_test', 
"train_valid’, label)) 


(continues on next page) 
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(continued from previous page) 


if label not in label_count or label_count[label] < n_valid_per_label: 
# Copy to train_valid_test/valid 
copyfile(fname, os.path.join(data_dir, 'train_valid_test', 
'valid', label)) 
label_count[label] = label_count.get(label, 0) + 1 
else: 
# Copy to train_valid_test/train 
copyfile(fname, os.path.join(data_dir, 'train_valid_test', 
'train', label)) 
return n_valid_per_label 


The reorg_test function below is used to organize the testing set to facilitate the reading during 
prediction. 


#@save 
def reorg_test(data_dir): 
for test_file in os.listdir(os.path.join(data_dir, 'test')): 
copyfile(os.path.join(data_dir, 'test', test_file), 
os.path.join(data_dir, 'train_valid_test', ‘test’, 
"unknown ”)) 


Finally, we use a function to call the previously defined read_csv_labels, reorg_train_valid, and 
reorg_test functions. 


def reorg_cifar10_data(data_dir, valid_ratio): 
labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv')) 
reorg_train_valid(data_dir, labels, valid_ratio) 
reorg_test(data_dir) 


We only set the batch size to 4 for the demo dataset. During actual training and testing, the com- 
plete dataset of the Kaggle competition should be used and batch_size should be set to a larger 
integer, such as 128. We use 10% of the training examples as the validation set for tuning hyper- 
parameters. 


batch_size = 4 if demo else 128 
valid_ratio = 0.1 
reorg_cifar10_data(data_dir, valid_ratio) 


13.13.2 Image Augmentation 


To cope with overfitting, we use image augmentation. For example, by adding transforms. 
RandomFlipLeftRight(), the images can be flipped at random. We can also perform normalization 
for the three RGB channels of color images using transforms.Normalize(). Below, we list some 
of these operations that you can choose to use or modify depending on requirements. 


transform_train = gluon.data. vision. transforms .Compose([ 
# Magnify the image to a square of 40 pixels in both height and width 
gluon.data.vision.transforms.Resize(40), 
# Randomly crop a square image of 40 pixels in both height and width to 
# produce a small square of 0.64 to 1 times the area of the original 


(continues on next page) 
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# image, and then shrink it to a square of 32 pixels in both height and 

# width 

gluon.data.vision.transforms.RandomResizedCrop(32, scale=(0.64, 1.0), 
ratio=(1.0, 1.0)), 

gluon.data.vision.transforms.RandomFlipLeftRight(), 

gluon.data.vision.transforms.ToTensor(), 

# Normalize each channel of the image 

gluon.data.vision.transforms.Normalize([0.4914, 0.4822, @.4465], 

[0.2023, 0.1994, @.2010])]) 


In order to ensure the certainty of the output during testing, we only perform normalization on 
the image. 


transform_test = gluon.data.vision.transforms.Compose([ 
gluon.data.vision.transforms.ToTensor(), 
gluon.data.vision.transforms.Normalize([0.4914, 0.4822, 0.4465], 
[0.2023, 0.1994, 0.2010])]) 


13.13.3 Reading the Dataset 


Next, we can create the ImageFolderDataset instance to read the organized dataset containing the 
original image files, where each example includes the image and label. 


train_ds, valid_ds, train_valid_ds, test_ds = [ 
gluon.data.vision.ImageFolderDataset( 
os.path.join(data_dir, 'train_valid_test', folder)) 
for folder in ['train', 'valid', 'train_valid', 'test']] 


We specify the defined image augmentation operation in DataLoader. During training, we only use 
the validation set to evaluate the model, so we need to ensure the certainty of the output. During 
prediction, we will train the model on the combined training set and validation set to make full 
use of all labelled data. 


train_iter, train_valid_iter = [gluon.data.DataLoader( 
dataset. transform_first(transform_train), batch_size, shuffle=True, 
last_batch='discard') for dataset in (train_ds, train_valid_ds)] 


valid_iter = gluon.data.DataLoader ( 
valid_ds.transform_first(transform_test), batch_size, shuffle=False, 
last_batch='discard’') 


test_iter = gluon.data.DataLoader( 
test_ds.transform_first(transform_test), batch_size, shuffle=False, 
last_batch='keep') 
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13.13.4 Defining the Model 


Here, we build the residual blocks based on the HybridBlock class, which is slightly different than 
the implementation described in Section 7.6. This is done to improve execution efficiency. 


class Residual (nn.HybridBlock) : 
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs): 
super(Residual, self).__init__(**kwargs) 
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1, 
strides=strides) 
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1) 
if use_1x1conv: 
self.conv3 = nn.Conv2D(num_channels, kernel_size=1, 
strides=strides) 
else: 
self.conv3 = None 
self.bn1 = nn.BatchNorm() 
self.bn2 = nn.BatchNorm() 


def hybrid_forward(self, F, X): 
Y = F.npx.relu(self.bn1(self.conv1(X))) 
Y = self.bn2(self.conv2(Y)) 
if self.conv3: 
X = self.conv3(X) 
return F.npx.relu(Y + X) 


Next, we define the ResNet-18 model. 


def resnet18(num_classes): 
net = nn.HybridSequential() 
net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1), 
nn.BatchNorm(), nn.Activation('relu')) 


def resnet_block(num_channels, num_residuals, first_block=False): 
blk = nn.HybridSequential () 
for i in range(num_residuals): 
if i == @ and not first_block: 
blk.add(Residual(num_channels, use_1x1conv=True, strides=2)) 
else: 
blk. add(Residual (num_channels) ) 
return blk 


net.add(resnet_block(64, 2, first_block=True), 

resnet_block(128, 2), 

resnet_block(256, 2), 

resnet_block(512, 2)) 
net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes) ) 
return net 


The CIFAR-10 image classification challenge uses 10 categories. We will perform Xavier random 
initialization on the model before training begins. 


def get_net(devices): 
num_classes = 10 
net = resnet18(num_classes) 


(continues on next page) 
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net.initialize(ctx=devices, init=init.Xavier()) 
return net 


loss = gluon.loss.SoftmaxCrossEntropyLoss() 


13.13.5 Defining the Training Functions 


We will select the model and tune hyperparameters according to the model’s performance on the 
validation set. Next, we define the model training function train. We record the training time of 
each epoch, which helps us compare the time costs of different models. 


def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, 1r_period, 
lr_decay): 
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
{'’learning_rate’: lr, 'momentum': 0.9, 'wd': wd}) 
num_batches, timer = len(train_iter), d21.Timer() 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], 
legend=['train loss', 'train acc’, ‘valid acc']) 
for epoch in range(num_epochs): 
metric = d21.Accumulator(3) 
if epoch > @ and epoch % 1r_period == 0: 
trainer.set_learning_rate(trainer.learning_rate * lr_decay) 
for i, (features, labels) in enumerate(train_iter): 
timer.start() 
l, acc = d21.train_batch_ch13( 
net, features, labels.astype('float32'), loss, trainer, 
devices, d21.split_batch) 
metric.add(1, acc, labels.shape[Q]) 
timer.stop() 
if (i + 1) % (num_batches // 5) == Q or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metric[0] / metric[2], metric[1] / metric[2], 
None) ) 
if valid_iter is not None: 
valid_acc = d21.evaluate_accuracy_gpus(net, valid_iter, 
d21.split_batch) 
animator.add(epoch + 1, (None, None, valid_acc)) 
if valid_iter is not None: 
print(f'loss (metric[0] / metric[2]:.3f}, ’ 
f'train acc (metric[1] / metric[2]:.3f), ' 
f'valid acc {valid_acc: .3f}') 
else: 
print(f'loss (metric[0] / metric[2]:.3f}, ’ 
f'train acc (metric[1] / metric[2]:.3f)') 
print(f'(metric[2] * num_epochs / timer.sum():.1f} examples/sec 
f'on (str(devices))') 
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13.13.6 Training and Validating the Model 


Now, we can train and validate the model. The following hyperparameters can be tuned. For 
example, we can increase the number of epochs. Because 1r_period and 1r_decay are set to 50 
and 0.1 respectively, the learning rate of the optimization algorithm will be multiplied by 0.1 after 
every 50 epochs. For simplicity, we only train one epoch here. 


devices, num_epochs, Ir, wd = d21.try_all_gpus(), 5, 0.1, 5e-4 

Ir_period, lr_decay, net = 50, 0.1, get_net(devices) 

net .hybridize() 

train(net, train_iter, valid_iter, num_epochs, 1r, wd, devices, 1r_period, 
Ir_decay) 


loss 2.294, train acc 0.148, valid acc 0.100 
136.0 examples/sec on [gpu(0), gpu(1)] 


—— train loss 
=== train acc 
—-= valid acc 





epoch 


13.13.7 Classifying the Testing Set and Submitting Results on Kaggle 


After obtaining a satisfactory model design and hyperparameters, we use all training datasets (in- 
cluding validation sets) to retrain the model and classify the testing set. 


net, preds = get_net(devices), [] 

net .hybridize() 

train(net, train_valid_iter, None, num_epochs, Ir, wd, devices, 1r_period, 
Ir_decay) 


for X, _ in test_iter: 
y_hat = net(X.as_in_ctx(devices[0])) 
preds.extend(y_hat.argmax(axis=1).astype(int).asnumpy()) 
sorted_ids = list(range(1, len(test_ds) + 1)) 
sorted_ids.sort(key=lambda x: str(x)) 
df = pd.DataFrame({'id': sorted_ids, ‘label’: preds}) 
df['label'] = df['’label'].apply(lambda x: train_valid_ds.synsets[x]) 
df.to_csv('submission.csv', index=False) 
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loss nan, train acc 0.102 
131.2 examples/sec on [gpu(0), gpu(1)] 


200 
—— train loss 
150 e aa acc 
—-- valid acc 
100 
50 
0 





epoch 


After executing the above code, we will get a “submission.csv” file. The format of this file is con- 
sistent with the Kaggle competition requirements. The method for submitting results is similar 
to method in Section 4.10. 


Summary 
+ We can create an ImageFolderDataset instance to read the dataset containing the original 
image files. 


+ We can use convolutional neural networks, image augmentation, and hybrid programming 
to take part in an image classification competition. 


Exercises 


1. Use the complete CIFAR-10 dataset for the Kaggle competition. Change the batch_size and 
number of epochs num_epochs to 128 and 100, respectively. See what accuracy and ranking 
you can achieve in this competition. 


2. What accuracy can you achieve when not using image augmentation? 


3. Scan the QR code to access the relevant discussions and exchange ideas about the meth- 
ods used and the results obtained with the community. Can you come up with any better 
techniques? 


Discussions!” 





122 https://discuss.d21.ai/t/379 
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13.14 Dog Breed Identification (ImageNet Dogs) on Kaggle 


In this section, we will tackle the dog breed identification challenge in the Kaggle Competition. 
The competition's web address is 


https: //www.kaggle.com/c/dog-breed-identification 


In this competition, we attempt to identify 120 different breeds of dogs. The dataset used in this 
competition is actually a subset of the famous ImageNet dataset. Different from the images in the 
CIFAR-10 dataset used in the previous section, the images in the ImageNet dataset are higher and 
wider and their dimensions are inconsistent. 


Fig. 13.14.1 shows the information on the competition's webpage. In order to submit the results, 
please register an account on the Kaggle website first. 


Playground Prediction Competition 


Dog Breed Identification 


Determine the breed of adog in an image 


Kaggle - 1,286 teams - 4 months ago 





Overview Data Kernels Discussion Leaderboard Rules 





Overview 


Description Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have 
all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a 


Evaluation four-legged stranger: what kind of good pup is that? 


In this playground competition, you are provided a strictly canine subset of ImageNet in order to practice 
fine-grained image categorization. How well you can tell your Norfolk Terriers from your Norwich 
Terriers? With 120 breeds of dogs and a limited number training images per class, you might find the 
problem more, err, ruff than you anticipated. 





Fig. 13.14.1: Dog breed identification competition website. The dataset for the competition can 
be accessed by clicking the “Data” tab. 


First, import the packages or modules required for the competition. 
from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, npx 

from mxnet.gluon import nn 


import os 


npx.set_np() 
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13.14.1 Obtaining and Organizing the Dataset 


The competition data is divided into a training set and testing set. The training set contains 10, 222 
images and the testing set contains 10,357 images. The images in both sets are in JPEG format. 
These images contain three RGB channels (color) and they have different heights and widths. 
There are 120 breeds of dogs in the training set, including Labradors, Poodles, Dachshunds, 
Samoyeds, Huskies, Chihuahuas, and Yorkshire Terriers. 


Downloading the Dataset 


After logging in to Kaggle, we can click on the “Data” tab on the dog breed identification com- 
petition webpage shown in Fig. 13.14.1 and download the dataset by clicking the “Download All” 
button. After unzipping the downloaded file in ../data, you will find the entire dataset in the 
following paths: 


e ../data/dog-breed-identification/labels.csv 

* ../data/dog-breed-identification/sample_submission.csv 
e ../data/dog-breed-identification/train 

e ../data/dog-breed-identification/test 


You may have noticed that the above structure is quite similar to that of the CIFAR-10 competition 
in Section 13.13, where folders train/ and test/ contain training and testing dog images respec- 
tively, and labels.csv has the labels for the training images. 


Similarly, to make it easier to get started, we provide a small-scale sample of the dataset men- 
tioned above, “train_valid_test_tiny.zip”. If you are going to use the full dataset for the Kaggle 
competition, you will also need to change the demo variable below to False. 


#@save 
d21.DATA_HUB[ 'dog_tiny'] = (d21.DATA_URL + 'kaggle_dog_tiny.zip', 
"@cb91d09b814ecdc07b50f31f8dcad3e81d6a86d' ) 


# If you use the full dataset downloaded for the Kaggle competition, change 
# the variable below to False 
demo = True 


if demo: 
data_dir = d21.download_extract('dog_tiny') 
else: 
data_dir = os.path.join(’..’, ‘data’, 'dog-breed-identification’) 


Downloading ../data/kaggle_dog_tiny.zip from http://d21-data.s3-accelerate. amazonaws.com/ 
<kaggle_dog_tiny.zip... 
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Organizing the Dataset 


We can organize the dataset similarly to what we did in Section 13.13, namely separating a valida- 
tion set from the training set, and moving images into subfolders grouped by labels. 


The reorg_dog_data function below is used to read the training data labels, segment the validation 
set, and organize the training set. 


def reorg_dog_data(data_dir, valid_ratio): 
labels = d21.read_csv_labels(os.path.joiní(data_dir, 'labels.csv')) 
d21.reorg_train_valid(data_dir, labels, valid_ratio) 
d21.reorg_test(data_dir) 


batch_size = 4 if demo else 128 
valid_ratio = 0.1 
reorg_dog_data(data_dir, valid_ratio) 


13.14.2 Image Augmentation 


The size of the images in this section are larger than the images in the previous section. Here are 
some more image augmentation operations that might be useful. 


transform_train = gluon.data.vision.transforms.Compose([ 

# Randomly crop the image to obtain an image with an area of 0.08 to 1 of 

# the original area and height to width ratio between 3/4 and 4/3. Then, 

# scale the image to create a new image with a height and width of 224 

# pixels each 

gluon.data. vision. transforms.RandomResizedCrop(224, scale=(0.08, 1.0), 
ratio=(3.0/4.0, 4.0/3.0)), 

gluon.data.vision.transforms.RandomFlipLeftRight() , 

# Randomly change the brightness, contrast, and saturation 

gluon.data. vision. transforms.RandomColorJitter(brightness=0.4, 
contrast=0.4, 
saturation=0.4), 

# Add random noise 

gluon.data.vision.transforms.RandomLighting(0.1), 

gluon.data.vision.transforms.ToTensor(), 

# Standardize each channel of the image 

gluon.data.vision.transforms.Normalize([0.485, 0.456, 0.406], 

[0.229, 0.224, 0.2251)]) 


During testing, we only use definite image preprocessing operations. 


transform_test = gluon.data.vision.transforms.Compose([ 
gluon.data.vision.transforms.Resize(256), 
# Crop a square of 224 by 224 from the center of the image 
gluon.data.vision.transforms.CenterCrop(224), 
gluon.data.vision.transforms.ToTensor(), 
gluon.data.vision.transforms.Normalize([0.485, 0.456, 0.406], 
[0.229, 0.224, @.225])]) 
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13.14.3 Reading the Dataset 


As in the previous section, we can create an ImageFolderDataset instance to read the dataset con- 
taining the original image files. 


train_ds, valid_ds, train_valid_ds, test_ds = [ 
gluon.data.vision.ImageFolderDataset( 
os.path.join(data_dir, 'train_valid_test', folder)) 
for folder in ('train', 'valid', 'train_valid', 'test')] 


Here, we create DataLoader instances, just like in Section 13.13. 


train_iter, train_valid_iter = [gluon.data.DataLoader( 
dataset.transform_first(transform_train), batch_size, shuffle=True, 
last_batch='discard’) for dataset in (train_ds, train_valid_ds)] 


valid_iter = gluon.data.DataLoader ( 
valid_ds.transform_first(transform_test), batch_size, shuffle=False, 
last_batch='discard’') 


test_iter = gluon.data.DataLoader ( 
test_ds.transform_first(transform_test), batch_size, shuffle=False, 
last_batch='keep’') 


13.14.4 Defining the Model 


The dataset for this competition is a subset of the ImageNet data set. Therefore, we can use the 
approach discussed in Section 13.2 to select a model pre-trained on the entire ImageNet dataset 
and use it to extract image features to be input in the custom small-scale output network. Gluon 
provides a wide range of pre-trained models. Here, we will use the pre-trained ResNet-34 model. 
Because the competition dataset is a subset of the pre-training dataset, we simply reuse the in- 
put of the pre-trained model's output layer, i.e., the extracted features. Then, we can replace the 
original output layer with a small custom output network that can be trained, such as two fully 
connected layers in a series. Different from the experiment in Section 13.2, here, we do not re- 
train the pre-trained model used for feature extraction. This reduces the training time and the 
memory required to store model parameter gradients. 


You must note that, during image augmentation, we use the mean values and standard deviations 
of the three RGB channels for the entire ImageNet dataset for normalization. This is consistent 
with the normalization of the pre-trained model. 


def get_net(devices): 
finetune_net = gluon.model_zoo. vision. resnet34_v2(pretrained=True) 
# Define a new output network 
finetune_net.output_new = nn.HybridSequential(prefix='’) 
finetune_net.output_new.add(nn.Dense(256, activation='relu')) 
# There are 120 output categories 
finetune_net.output_new. add(nn.Dense(120)) 
# Initialize the output network 
finetune_net.output_new.initialize(init.Xavier(), ctx=devices) 
# Distribute the model parameters to the CPUs or GPUs used for computation 
finetune_net.collect_params().reset_ctx(devices) 
return finetune_net 
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When calculating the loss, we first use the member variable features to obtain the input of the 
pre-trained model’s output layer, i.e., the extracted feature. Then, we use this feature as the input 
for our small custom output network and compute the output. 


loss = gluon.loss.SoftmaxCrossEntropyLoss() 


def evaluate_loss(data_iter, net, devices): 

l_sum, n = 0.0, 0 

for features, labels in data_iter: 
X_shards, y_shards = d21.split_batch(features, labels, devices) 
output_features = [net.features(X_shard) for X_shard in X_shards] 
outputs = [net.output_new(feature) for feature in output_features] 
ls = [loss(output, y_shard).sum() for output, y_shard 

in zip(outputs, y_shards) ] 

1_sum += sum([float(1.sum()) for 1 in 1s]) 
n += labels.size 

return 1_sum / n 


13.14.5 Defining the Training Functions 


We will select the model and tune hyperparameters according to the model's performance on the 
validation set. The model training function train only trains the small custom output network. 


def train(net, train_iter, valid_iter, num_epochs, Ir, wd, devices, lr_period, 
lr_decay): 
# Only train the small custom output network 
trainer = gluon.Trainer(net.output_new.collect_params(), 'sgd', 
£'learning_rate': lr, 'momentum': 0.9, ‘wd’: wd}) 
num_batches, timer = len(train_iter), d21.Timer() 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], 
legend=['train loss', 'valid loss'’]) 
for epoch in range(num_epochs): 
metric = d21.Accumulator (2) 
if epoch > @ and epoch % 1r_period == 0: 
trainer.set_learning_rate(trainer.learning_rate x 1r_decay) 
for i, (features, labels) in enumerate(train_iter): 
timer. .start() 
X_shards, y_shards = d21.split_batch(features, labels, devices) 
output_features = [net.features(X_shard) for X_shard in X_shards] 
with autograd.record(): 
outputs = [net.output_new(feature) 
for feature in output_features] 
ls = [loss(output, y_shard).sum() for output, y_shard 
in zip(outputs, y_shards) ] 
for 1 in ls: 
1. backward() 
trainer.step(batch_size) 
metric.add(sum([float(1.sum()) for 1 in 1s]), labels.shape[@]) 
timer.stop() 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metric[0] / metric[1], None)) 
if valid_iter is not None: 
valid_loss = evaluate_loss(valid_iter, net, devices) 


(continues on next page) 
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(continued from previous page) 


animator.add(epoch + 1, (None, valid_loss)) 

if valid_iter is not None: 

print(f'train loss (metric[0] / metric[1]:.3f}, ' 

f'valid loss {valid_loss: .3f}’) 

else: 

print(f'train loss (metric[0] / metric[1]:.3f)') 
print(f'(metric[1] * num_epochs / timer.sum():.1f} examples/sec 

f'on (str(devices))') 


1 


13.14.6 Training and Validating the Model 


Now, we can train and validate the model. The following hyperparameters can be tuned. For 
example, we can increase the number of epochs. Because 1r_period and 1r_decay are set to 10 
and 0.1 respectively, the learning rate of the optimization algorithm will be multiplied by 0.1 after 
every 10 epochs. 


devices, num_epochs, lr, wd = d2l.try_all_gpus(), 5, 0.01, le-4 

lr_period, lr_decay, net = 10, 0.1, get_net(devices) 

net.hybridize() 

train(net, train_iter, valid_iter, num_epochs, 1r, wd, devices, lr_period, 
lr_decay) 


train loss 2.371, valid loss 2.556 
220.4 examples/sec on [gpu(Q), gpu(1)] 


—— train loss 
=== valid loss 
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13.14.7 Classifying the Testing Set and Submitting Results on Kaggle 


After obtaining a satisfactory model design and hyperparameters, we use all training datasets (in- 
cluding validation sets) to retrain the model and then classify the testing set. Note that predictions 
are made by the output network we just trained. 


net = get_net(devices) 

net.hybridize() 

train(net, train_valid_iter, None, num_epochs, Ir, wd, devices, lr_period, 
Ir_decay) 


preds = [] 

for data, label in test_iter: 
output_features = net.features(data.as_in_ctx(devices[0])) 
output = npx.softmax(net.output_new(output_features) ) 
preds.extend(output. asnumpy()) 

ids = sorted(os. listdir( 
os.path.join(data_dir, 'train_valid_test', ‘test’, 'unknown'))) 


with open('submission.csv', 'w') as f: 
f.wite('id,' + ','.join(train_valid_ds.synsets) + '\n’) 
for i, output in zip(ids, preds): 
f. wite(i.split('.')[0] + ',’ + ','.join( 


Estr(num) for num in output]) + '\n’) 


train loss 2.396 
213.4 examples/sec on [gpu(0), gpu(1)] 


5.0 
— train loss 


4.5 === valid loss 
4.0 
3.5 
3.0 


2.5 





epoch 


After executing the above code, we will generate a “submission.csv” file. The format of this file 
is consistent with the Kaggle competition requirements. The method for submitting results is 
similar to method in Section 4.10. 
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Summary 


e We can use a model pre-trained on the ImageNet dataset to extract features and only train a 
small custom output network. This will allow us to classify a subset of the ImageNet dataset 
with lower computing and storage overhead. 


Exercises 
1. When using the entire Kaggle dataset, what kind of results do you get when you increase the 
batch_size (batch size) and num_epochs (number of epochs)? 
2. Do you get better results if you use a deeper pre-trained model? 


3. Scan the QR code to access the relevant discussions and exchange ideas about the meth- 
ods used and the results obtained with the community. Can you come up with any better 
techniques? 


Discussions??? 





193 https://discuss.d21.ai/t/380 
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14 Natural Language Processing: Pre- 
training 


Humans need to communicate. Out of this basic need of the human condition, a vast amount of 
written text has been generated on an everyday basis. Given rich text in social media, chat apps, 
emails, product reviews, news articles, research papers, and books, it becomes vital to enable 
computers to understand them to offer assistance or make decisions based on human languages. 


Natural language processing studies interactions between computers and humans using natural 
languages. In practice, it is very common to use natural language processing techniques to pro- 
cess and analyze text (human natural language) data, such as language models in Section 8.3 and 
machine translation models in Section 9.5. 


To understand text, we can begin with its representation, such as treating each word or subword 
as an individual text token. As we will see in this chapter, the representation of each token can 
be pretrained on a large corpus, using word2vec, GloVe, or subword embedding models. After 
pretraining, representation of each token can be a vector, however, it remains the same no matter 
what the context is. For instance, the vector representation of “bank” is the same in both “go 
to the bank to deposit some money” and “go to the bank to sit down”. Thus, many more recent 
pretraining models adapt representation of the same token to different contexts. Among them is 
BERT, a much deeper model based on the Transformer encoder. In this chapter, we will focus on 
how to pretrain such representations for text, as highlighted in Fig. 14.1. 


Í À 
1 
1 
\ / 


Fig. 14.1: Pretrained text representations can be fed to various deep learning architectures for 
different downstream natural language processing applications. This chapter focuses on the up- 
stream text representation pretraining. 
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As shown in Fig. 14.1, the pretrained text representations can be fed to a variety of deep learning 
architectures for different downstream natural language processing applications. We will cover 
them in Chapter 15. 


14.1 Word Embedding (word2vec) 


A natural language is a complex system that we use to express meanings. In this system, words 
are the basic unit of linguistic meaning. As its name implies, a word vector is a vector used to 
represent a word. It can also be thought of as the feature vector of a word. The technique of 
mapping words to vectors of real numbers is also known as word embedding. Over the last few 
years, word embedding has gradually become basic knowledge in natural language processing. 


14.1.1 Why Not Use One-hot Vectors? 


We used one-hot vectors to represent words (characters are words) in Section 8.5 . Recall that 
when we assume the number of different words in a dictionary (the dictionary size) is N, each 
word can correspond one-to-one with consecutive integers from 0 to N — 1. These integers that 
correspond to words are called the indices of the words. We assume that the index of a word is i. 
In order to get the one-hot vector representation of the word, we create a vector of all 0s with a 
length of N and set element i to 1. In this way, each word is represented as a vector of length N 
that can be used directly by the neural network. 


Although one-hot word vectors are easy to construct, they are usually nota good choice. One ofthe 
major reasons is that the one-hot word vectors cannot accurately express the similarity between 
different words, such as the cosine similarity that we commonly use. For the vectors x,y € R$, 
their cosine similarities are the cosines of the angles between them: 


E 
XY e 
SUI! 


Since the cosine similarity between the one-hot vectors of any two different words is 0, itis difficult 
to use the one-hot vector to accurately represent the similarity between multiple different words. 


[-1, 1). (14.1.1) 


Word2vec*”* is a tool that we came up with to solve the problem above. It represents each word 
with a fixed-length vector and uses these vectors to better indicate the similarity and analogy rela- 
tionships between different words. The Word2vec tool contains two models: skip-gram (Mikolov 
et al., 2013b) and continuous bag of words (CBOW) (Mikolov et al., 2013a). Next, we will take a 
look at the two models and their training methods. 


14.1.2 The Skip-Gram Model 


The skip-gram model assumes that a word can be used to generate the words that surround itin a 
text sequence. For example, we assume that the text sequence is “the”, “man”, “loves”, “his”, and 
“son”. We use “loves” as the central target word and set the context window size to 2. As shown 
in Fig. 14.1.1, given the central target word “loves”, the skip-gram model is concerned with the 
conditional probability for generating the context words, “the”, “man”, “his” and “son”, that are 


within a distance of no more than 2 words, which is 


P("the", "man", "his", "son" | "loves"). (14.1.2) 





14 https://code.google.com/archive/p/word2vec/ 
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We assume that, given the central target word, the context words are generated independently of 
each other. In this case, the formula above can be rewritten as 


P("the" | "loves") - P("man" | "loves") - P("his" | "loves") - P("son" | "loves"). (14.1.3) 


the man his son 


loves 


Fig. 14.1.1: The skip-gram model cares about the conditional probability of generating context 
words for a given central target word. 


In the skip-gram model, each word is represented as two d-dimension vectors, which are used to 
compute the conditional probability. We assume that the word is indexed as i in the dictionary, its 
vector is represented as v; € R? when it is the central target word, and u; € R? when it is a context 
word. Let the central target word w, and context word w, be indexed as c and o respectively in the 
dictionary. The conditional probability of generating the context word for the given central target 
word can be obtained by performing a softmax operation on the vector inner product: 





E 
Po, | we) = PY (14.1.4) 

> ev exp(u; Ve) 
where vocabulary index set Y = {0,1,...,|V|—1}. Assume that a text sequence of length T is given, 


where the word at time step t is denoted as w“). Assume that context words are independently 
generated given center words. When context window size is m, the likelihood function of the 
skip-gram model is the joint probability of generating all the context words given any center word 


T 
IT] JD Pœ“ |e), (14.1.5) 


t=1 —m<j<m, j#0 


Here, any time step that is less than 1 or greater than T can be ignored. 


Skip-Gram Model Training 


The skip-gram model parameters are the central target word vector and context word vector for 
each individual word. In the training process, we are going to learn the model parameters by 
maximizing the likelihood function, which is also known as maximum likelihood estimation. This 
is equivalent to minimizing the following loss function: 


T 
-X Y log P(w" | w). (14.1.6) 
t=1 —m<j<m, j#0 


If we use the SGD, in each iteration we are going to pick a shorter subsequence through random 
sampling to compute the loss for that subsequence, and then compute the gradient to update the 
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model parameters. The key of gradient computation is to compute the gradient of the logarithmic 
conditional probability for the central word vector and the context word vector. By definition, we 
first have 


log P(w, | we) = ul ve — log (= exp) . (14.1.7) 
¡ev 


Through differentiation, we can get the gradient v. from the formula above. 
Alog P(wo| we) _, _ Liev exp(u; v.)u; 
OVe "Y jevexp(u] ve) 
exp(uj ve) 
=U, T uj 
jev J iev exp(u; ve) 
=u, — Y P(w; | we)u;. 


jev 











(14.1.8) 


Its computation obtains the conditional probability for all the words in the dictionary given the 
central target word w,. We then use the same method to obtain the gradients for other word vec- 
tors. 


After the training, for any word in the dictionary with index i, we are going to get its two word 
vector sets v; and u;. In applications of natural language processing, the central target word vector 
in the skip-gram model is generally used as the representation vector of a word. 


14.1.3 The Continuous Bag of Words (CBOW) Model 


The continuous bag of words (CBOW) model is similar to the skip-gram model. The biggest dif- 
ference is that the CBOW model assumes that the central target word is generated based on the 
context words before and after it in the text sequence. With the same text sequence “the”, “man”, 
“loves”, “his” and “son”, in which “loves” is the central target word, given a context window size 
of 2, the CBOW model is concerned with the conditional probability of generating the target word 


D « 


“loves” based on the context words “the”, “man”, “his” and “son”(as shown in Fig. 14.1.2), such as 


P("loves" | "the", "man", "his", "son"). (14.1.9) 


loves 


the man his son 


Fig. 14.1.2: The CBOW model cares about the conditional probability of generating the central 
target word from given context words. 


Since there are multiple context words in the CBOW model, we will average their word vectors 
and then use the same method as the skip-gram model to compute the conditional probability. 
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We assume that v; € R? and u; € R? are the context word vector and central target word vector 
of the word with index i in the dictionary (notice that the symbols are opposite to the ones in the 
skip-gram model). Let central target word w. be indexed as c, and context words wo,,...,Wo,,, be 
indexed as 01,..., 02 in the dictionary. Thus, the conditional probability of generating a central 
target word from the given context word is 


exp (¿547 (Vo, +---,+Voom)) 


(14.1.10) 
Viev exp (zu; (Vor Tees +Voom)) 





Pla | Woj,- en , Woon) = 


For brevity, denote W, = (Wo, ,...,Wo»,, +, and Vo = (Vo, +...,+Vo,,, ) /(2m). The equation above 
can be simplified as 


exp (ul Vo) 
Cc o) = ZA 14.1.11 
Plwe | Wo) = S exp (07 V.) m 





Given a text sequence of length T, we assume that the word at time step t is w®), and the context 
window size is m. The likelihood function of the CBOW model is the probability of generating any 
central target word from the context words. 


T 
II P(w | wim) eee, wet) ies wT™), (14.1.12) 
t=1 


CBOW Model Training 


CBOW model training is quite similar to skip-gram model training. The maximum likelihood es- 
timation of the CBOW model is equivalent to minimizing the loss function. 








T 
— > log P(w | ye we) wed. wt), (14.1.13) 
t=1 
Notice that 
log P(we | Wo) = ul v, — log (= exp (7) (14.1.14) 
¡eV 
Through differentiation, we can compute the logarithm of the conditional probability of the gra- 
dient of any context word vector v,,(i= 1,..., 2m) in the formula above. 
Alog P(we | Wo.) 1 exp(u! Vo)uj 1 
= Ue = = uc P(wj | Wo)uj 
OVo; 2m 2. icy exp(u; Vo) 2m 2. j j 


(14.1.15) 


We then use the same method to obtain the gradients for other word vectors. Unlike the skip-gram 
model, we usually use the context word vector as the representation vector for a wordin the CBOW 
model. 
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Summary 


e A word vector is a vector used to represent a word. The technique of mapping words to 
vectors of real numbers is also known as word embedding. 


e Word2vec includes both the continuous bag of words (CBOW) and skip-gram models. The 
skip-gram model assumes that context words are generated based on the central target word. 
The CBOW model assumes that the central target word is generated based on the context 
words. 


Exercises 


1. What is the computational complexity of each gradient? If the dictionary contains a large 
volume of words, what problems will this cause? 


2. There are some fixed phrases in the English language which consist of multiple words, such 
as “new york”. How can you train their word vectors? Hint: See section 4 in the Word2vec 
paper (Mikolov et al., 2013b). 


3. Use the skip-gram model as an example to think about the design of a word2vec model. What 
is the relationship between the inner product of two word vectors and the cosine similarity 
in the skip-gram model? For a pair of words with close semantical meaning, why it is likely 
for their word vector cosine similarity to be high? 


Discussions!” 


14.2 Approximate Training 


Recall content of the last section. The core feature of the skip-gram model is the use of softmax 
operations to compute the conditional probability of generating context word w, based on the 
given central target word w.. 





exp(u; Vo) 
P(wo | we) = : (14.2.1) 
icy exp(u] ve) 
The logarithmic loss corresponding to the conditional probability is given as 
— log P(w, | we) =—u lv. + log (x expla! v] : (14.2.2) 
¡eV 


Because the softmax operation has considered that the context word could be any word in the 
dictionary V, the loss mentioned above actually includes the sum of the number of items in the 
dictionary size. From the last section, we know that for both the skip-gram model and CBOW 
model, because they both get the conditional probability using a softmax operation, the gradient 
computation for each step contains the sum of the number of items in the dictionary size. For 
larger dictionaries with hundreds of thousands or even millions of words, the overhead for com- 
puting each gradient may be too high. In order to reduce such computational complexity, we will 
introduce two approximate training methods in this section: negative sampling and hierarchical 
softmax. Since there is no major difference between the skip-gram model and the CBOW model, 
we will only use the skip-gram model as an example to introduce these two training methods in 
this section. 





195 https://discuss.d21.ai/t/381 
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14.2.1 Negative Sampling 


Negative sampling modifies the original objective function. Given a context window forthe central 
target word w,, we will treat it as an event for context word w, to appear in the context window 
and compute the probability of this event from 


P(D =1| we, wo) = o (u? ve), (14.2.3) 
Here, the o function has the same definition as the sigmoid activation function: 
1 
Olt) eme ae ee (14.2.4) 


We will first consider training the word vector by maximizing the joint probability of all events in 
the text sequence. Given a text sequence of length T, we assume that the word at time step t is w® 
and the context window size is m. Now we consider maximizing the joint probability 


T: 

I] JD PB=iju®, u). (14.2.5) 

t=1 —m<j<m, j#0 
However, the events included in the model only consider positive examples. In this case, only 
when all the word vectors are equal and their values approach infinity can the joint probabil- 
ity above be maximized to 1. Obviously, such word vectors are meaningless. Negative sampling 
makes the objective function more meaningful by sampling with an addition of negative exam- 
ples. Assume that event P occurs when context word w, appears in the context window of central 
target word w., and we sample K words that do not appear in the context window according to 
the distribution P(w) to act as noise words. We assume the event for noise word w;(k = 1,..., K) 
to not appear in the context window of central target word w, is Ng. Suppose that events P and 
Ni,..., Nx for both positive and negative examples are independent of each other. By considering 
negative sampling, we can rewrite the joint probability above, which only considers the positive 
examples, as 


T 
IL JL Pw |e, (14.2.6) 


t=1 —m<j<m, j#0 
Here, the conditional probability is approximated to be 
K 
Pw) | w) =PD=1 ww) [J P(D =0| w©, uy). (14.2.7) 
k=1, we~P(w) 


Let the text sequence index of word w® at time step t be i; and hy for noise word wz, in the dictio- 
nary. The logarithmic loss for the conditional probability above is 
K 
-log P(w | w) = — log P(D =1 | ww) — Y" dog P(D = 0 | w , wy) 
k=1, w¿P(w) 
K 
= — logo (uvi) — y log (1 =0 (u7, vi) ) 
k=1, we~P(w) 
K 
=-—log o (u, vi) — > logo (7, vi) ; 
k=1, w¿P(w) 
(14.2.8) 
Here, the gradient computation in each step of the training is no longer related to the dictionary 


size, but linearly related to K. When K takes a smaller constant, the negative sampling has a lower 
computational overhead for each step. 
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14.2.2 Hierarchical Softmax 


Hierarchical softmax is another type of approximate training method. It uses a binary tree for 
data structure as illustrated in Fig. 14.2.1, with the leaf nodes of the tree representing every word 
in the dictionary V. 


n(w3, 1) 





Fig. 14.2.1: Hierarchical Softmax. Each leaf node of the tree represents a word in the dictionary. 


We assume that L(w) is the number of nodes on the path (including the root and leaf nodes) from 
the root node of the binary tree to the leaf node of word w. Letn(w, j) be the j node on this path, 
with the context word vector u,,(.,;) We use Fig. 14.2.1 as an example, so L(w3) = 4. Hierarchical 
softmax will approximate the conditional probability in the skip-gram model as 


= 
E 
o 


)—1 
P(w | we) = o (Intwo, j +1) = leftChild(n(wo, j))] Whw) Ye) , (14.2.9) 
1 


a. 
lI 


Here the o function has the same definition as the sigmoid activation function, and leftChild(n) 
is the left child node of node n. If x is true, |x] = 1; otherwise [x] = —1. Now, we will compute 
the conditional probability of generating word w3 based on the given word we in Fig. 14.2.1. We 
need to find the inner product of word vector ve (for word wc) and each non-leaf node vector on 
the path from the root node to w3. Because, in the binary tree, the path from the root node to leaf 
node w3 needs to be traversed left, right, and left again (the path with the bold line in Fig. 14.2.1), 
we get 

P(w3 | we) = (Us 13 Ve) . o (Whang 2) Ve) . (Us 3) Ve): (14.2.10) 
Because o(x) + o(—x) = 1, the condition that the sum of the conditional probability of any word 
generated based on the given central target word w, in dictionary V be 1 will also suffice: 


> Pl fe) =": (14.2.11) 
wEV 


In addition, because the order of magnitude for L(w,) —1is O(log,|V|), when the size of dictionary 
V is large, the computational overhead for each step in the hierarchical softmax training is greatly 
reduced compared to situations where we do not use approximate training. 
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Summary 


e Negative sampling constructs the loss function by considering independent events that con- 
tain both positive and negative examples. The gradient computational overhead for each 
step in the training process is linearly related to the number of noise words we sample. 


e Hierarchical softmax uses a binary tree and constructs the loss function based on the path 
from the root node to the leaf node. The gradient computational overhead for each step in 
the training process is related to the logarithm of the dictionary size. 


Exercises 


1. Before reading the next section, think about how we should sample noise words in negative 
sampling. 


2. What makes the last formula in this section hold? 
3. How can we apply negative sampling and hierarchical softmax in the skip-gram model? 


Discussions!” 


14.3 The Dataset for Pretraining Word Embedding 


In this section, we will introduce how to preprocess a dataset with negative sampling Section 14.2 
and load into minibatches for word2vec training. The dataset we use is Penn Tree Bank (PTB)”, 
which is a small but commonly-used corpus. It takes samples from Wall Street Journal articles 
and includes training sets, validation sets, and test sets. 


First, import the packages and modules required for the experiment. 


from d21 import mxnet as d21 
import math 

from mxnet import gluon, np 
import os 

import random 


14.3.1 Reading and Preprocessing the Dataset 


This dataset has already been preprocessed. Each line of the dataset acts as a sentence. All the 
words in a sentence are separated by spaces. In the word embedding task, each word is a token. 


#@save 
d21.DATA_HUB[ 'ptb'] = (d21.DATA_URL + ‘ptb.zip’, 
'319d85e578af0cdc590547f26231e4e31cdf1e42'>) 


#@save 
def read_ptb(): 
data_dir = d21.download_extract('ptb’) 


(continues on next page) 





% https://discuss.d21.ai/t/382 
17 https://catalog.ldc.upenn.edu/LDC99T42 
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(continued from previous page) 
with open(os.path.join(data_dir, 'ptb.train.txt')) as f: 


raw_text = f.read() 
return [line.split() for line in raw_text.split('\n')] 


sentences = read_ptb() 
f'# sentences: {len(sentences) }’ 


"# sentences: 42069' 


Next we build a vocabulary with words appeared not greater than 10 times mapped into a “<unk>” 
token. Note that the preprocessed PTB data also contains “<unk>” tokens presenting rare words. 


vocab = d21.Vocab(sentences, min_freq=12) 
f'vocab size: {len(vocab) }' 


"vocab size: 6719’ 


14.3.2 Subsampling 

In text data, there are generally some words that appear at high frequencies, such “the”, “a”, and 
“in” in English. Generally speaking, in a context window, it is better to train the word embedding 
model when a word (such as “chip”) and a lower-frequency word (such as “microprocessor”) ap- 
pear at the same time, rather than when a word appears with a higher-frequency word (such as 
“the”). Therefore, when training the word embedding model, we can perform subsampling on 
the words (Mikolov et al., 2013b). Specifically, each indexed word w; in the dataset will drop out 
at a certain probability. The dropout probability is given as: 


t 
P(w;) = max (: — °) f (14.3.1) 


Here, f(w;) is the ratio of the instances of word w; to the total number of words in the dataset, 
and the constant t is a hyperparameter (set to 1074 in this experiment). As we can see, it is only 
possible to drop out the word w; in subsampling when f(w;) > t. The higher the word's frequency, 
the higher its dropout probability. 


#@save 
def subsampling(sentences, vocab): 
# Map low frequency words into <unk> 
sentences = [[vocab.idx_to_token[vocab[tk]] for tk in line] 
for line in sentences] 
# Count the frequency for each word 
counter = d21.count_corpus(sentences) 
num_tokens = sum(counter.values()) 


# Return True if to keep this token during subsampling 
def keep(token): 
return(random.uniform(@, 1) < 
math.sqrt(le-4 / counter[token] * num_tokens)) 


(continues on next page) 
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(continued from previous page) 


# Now do the subsampling 
return [[tk for tk in line if keep(tk)] for line in sentences] 


subsampled = subsampling(sentences, vocab) 


Compare the sequence lengths before and after sampling, we can see subsampling significantly 
reduced the sequence length. 


d21.set_figsize() 

d21.p1t.hist([[len(line) for line in sentences], 
[len(line) for line in subsampled]]) 

d21.p1t.xlabel('* tokens per sentence’) 

d21.plt.ylabel(’ count’) 

d21.plt.legend(L’origin’, 'subsampled’]); 





HA origin 
20000 Ha subsampled 
» 15000 
Cc 
=) 
O 
Y9 10000 
5000 
0 


0 20 40 60 80 
# tokens per sentence 


For individual tokens, the sampling rate of the high-frequency word “the” is less than 1/20. 
def compare_counts(token): 
return (f'# of "(token)”: ' 
f'before={sum(Lline.count(token) for line in sentences])}, 


f'after=(sum([line.count(token) for line in subsampled]) }') 


compare_counts(' the’) 


'# of "the": before=50770, after=2117' 


But the low-frequency word “join” is completely preserved. 


compare_counts('join”) 


'# of "join": before=45, after=45' 


Last, we map each token into an index to construct the corpus. 
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corpus = [vocab[line] for line in subsampled] 
corpusLQ: 3] 


[ro, 07, [71, 2115, 18, 274], [5277, 3054, 15807] 


14.3.3 Loading the Dataset 


Next we read the corpus with token indicies into data batches for training. 


Extracting Central Target Words and Context Words 


We use words with a distance from the central target word not exceeding the context window size 
as the context words of the given center target word. The following definition function extracts all 
the central target words and their context words. It uniformly and randomly samples an integer to 
be used as the context window size between integer 1 and the max_window_size (maximum context 
window). 


#@save 
def get_centers_and_contexts(corpus, max_window_size): 
centers, contexts = [], [] 
for line in corpus: 
# Each sentence needs at least 2 words to form a "central target word 
$ - context word” pair 
if len(line) < 2: 
continue 
centers += line 
for i in range(len(line)): # Context window centered at i 
window_size = random.randint(1, max_window_size) 
indices = list(range(max(@, i - window_size), 
min(len(line), i + 1 + window_size))) 
# Exclude the central target word from the context words 
indices. remove(i) 
contexts.append([lineLidx] for idx in indices]) 
return centers, contexts 


Next, we create an artificial dataset containing two sentences of 7 and 3 words, respectively. As- 
sume the maximum context window is 2 and print all the central target words and their context 
words. 


tiny_dataset = [list(range(7)), list(range(7, 10))] 

print('dataset', tiny_dataset) 

for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)): 
print('center', center, ‘has contexts’, context) 


dataset. Pid, 02.3) eo nol, N 
center Q has contexts [1, 2] 

center 1 has contexts [0, 2] 

center 2 has contexts [0, 1, 3, 4] 

center 3 has contexts [2, 4] 


(continues on next page) 
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(continued from previous page) 


center 4 has contexts [3, 5] 
center 5 has contexts [3, 4, 6] 
center 6 has contexts [5] 
center 7 has contexts [8, 9] 
center 8 has contexts [7, 9] 
center 9 has contexts [8] 


We set the maximum context window size to 5. The following extracts all the central target words 
and their context words in the dataset. 


all_centers, all_contexts = get_centers_and_contexts(corpus, 5) 
f'# center-context pairs: {len(all_centers) }’ 


'# center-context pairs: 353167’ 


Negative Sampling 


We use negative sampling for approximate training. For a central and context word pair, we ran- 
domly sample K noise words (K = 5 in the experiment). According to the suggestion in the 
Word2vec paper, the noise word sampling probability P(w) is the ratio of the word frequency of 
w to the total word frequency raised to the power of 0.75 (Mikolov et al., 2013b). 


We first define a class to draw a candidate according to the sampling weights. It caches a 10000 
size random number bank instead of calling random. choices every time. 


#@save 
class RandomGenerator: 
"""Draw a random int in [0, n] according to n sampling weights. 
def __init__(self, sampling_weights): 
self.population = list(range(len(sampling_weights))) 
self.sampling_weights = sampling_weights 
self.candidates = [] 
self.i = 0 


nnn 


def draw(self): 
if self.i == len(self.candidates): 
self.candidates = random.choices( 
self.population, self.sampling_weights, k=10000) 
self.i = ð 
self.i += 1 
return self.candidates[self.i-1] 


generator = RandomGenerator([2, 3, 4]) 
[generator.draw() for _ in range(10)] 


A Oz] 


#@save 
def get_negatives(all_contexts, corpus, K): 


(continues on next page) 





14.3. The Dataset for Pretraining Word Embedding 675 


(continued from previous page) 
counter = d21.count_corpus(corpus) 
sampling_weights = [counter[i]**®.75 for i in range(len(counter))] 
all_negatives, generator = [], RandomGenerator(sampling_weights) 
for contexts in all_contexts: 
negatives = [] 
while len(negatives) < len(contexts) x K: 
neg = generator.draw() 
# Noise words cannot be context words 
if neg not in contexts: 
negatives. append(neg) 
all_negatives.append(negatives) 
return all_negatives 


all_negatives = get_negatives(all_contexts, corpus, 5) 


Reading into Batches 


We extract all central target words all_centers, and the context words all_contexts and noise 
words all_negatives of each central target word from the dataset. We will read them in random 
minibatches. 


In a minibatch of data, the i example includes a central word and its corresponding n; con- 
text words and m; noise words. Since the context window size of each example may be differ- 
ent, the sum of context words and noise words, n; + my, will be different. When constructing a 
minibatch, we concatenate the context words and noise words of each example, and add 0s for 
padding until the length of the concatenations are the same, that is, the length of all concate- 
nations is max; n; + m,(max_len). In order to avoid the effect of padding on the loss function 
calculation, we construct the mask variable masks, each element of which corresponds to an el- 
ement in the concatenation of context and noise words, contexts_negatives. When an element 
in the variable contexts_negatives is a padding, the element in the mask variable masks at the 
same position will be 0. Otherwise, it takes the value 1. In order to distinguish between positive 
and negative examples, we also need to distinguish the context words from the noise words in the 
contexts_negatives variable. Based on the construction of the mask variable, we only need to 
create a label variable labels with the same shape as the contexts_negatives variable and set the 
elements corresponding to context words (positive examples) to 1, and the rest to 0. 


Next, we will implement the minibatch reading function batchify. Its minibatch input data is a 
list whose length is the batch size, each element of which contains central target words center, 
context words context, and noise words negative. The minibatch data returned by this function 
conforms to the format we need, for example, it includes the mask variable. 


#@save 
def batchify(data): 
max_len = max(len(c) + len(n) for _, c, n in data) 


centers, contexts_negatives, masks, labels = [1], [1, [C], [] 
for center, context, negative in data: 
cur_len = len(context) + len(negative) 
centers += [center] 
contexts_negatives += [context + negative + [0] * (max_len - cur_len)] 
masks += [[1] * cur_len + [0] * (max_len - cur_len)] 
labels += [[1] * len(context) + [0] * (max_len - len(context))] 


(continues on next page) 
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return (np.array(centers).reshape((-1, 1)), np.array(contexts_negatives), 
np.array(masks), np.array(labels)) 


Construct two simple examples: 


il = Cl, 12, 21) Sy 3, de 310) 
A Cl, 2 2 Zily a, SD 
batch = batchify((x_1, x_2)) 


names = ['centers', 'contexts_negatives', ‘masks’, 'labels'] 
for name, data in zip(names, batch): 
print(name, '=', data) 


centers = [[1.] 
Ed. 11 


contexts_negatives WL. 2s Bs Bs Bo Bel 


2, 2. Zo Bs Bo Wodld 

masks = ([bl, 1. o 1. o do] 
Old 

labels = [[1. 1. 0. 0. 0. 0.] 
Ed. il, dhe O. da ada 


We use the batchify function just defined to specify the minibatch reading method in the Dat- 
aLoader instance. 


14.3.4 Putting All Things Together 


Last, we define the load_data_ptb function that read the PTB dataset and return the data iterator. 


#@save 
def load_data_ptb(batch_size, max_window_size, num_noise_words): 
num_workers = d21.get_dataloader_workers() 
sentences = read_ptb() 
vocab = d21.Vocab(sentences, min_freq=10) 
subsampled = subsampling(sentences, vocab) 
corpus = [vocabLline] for line in subsampled] 
all_centers, all_contexts = get_centers_and_contexts( 
corpus, max_window_size) 
all_negatives = get_negatives(all_contexts, corpus, num_noise_words) 
dataset = gluon.data.ArrayDataset( 
all_centers, all_contexts, all_negatives) 
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True, 
batchify_fn=batchify, 
num_workers=num_workers) 
return data_iter, vocab 


Let us print the first minibatch of the data iterator. 


data_iter, vocab = load_data_ptb(512, 5, 5) 
for batch in data_iter: 
for name, data in zip(names, batch): 


(continues on next page) 
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print(name, 'shape:', data.shape) 
break 


centers shape: (512, 1) 
contexts_negatives shape: (512, 60) 
masks shape: (512, 60) 
labels shape: (512, 60) 


Summary 


e Subsampling attempts to minimize the impact of high-frequency words on the training of a 
word embedding model. 


e We can pad examples of different lengths to create minibatches with examples of all the 
same length and use mask variables to distinguish between padding and non-padding ele- 
ments, so that only non-padding elements participate in the calculation of the loss function. 


Exercises 


1. We use the batchify function to specify the minibatch reading method in the DataLoader 
instance and print the shape of each variable in the first batch read. How should these shapes 
be calculated? 


Discussions! 


14.4 Pretraining word2vec 


In this section, we will train a skip-gram model defined in Section 14.1. 


First, import the packages and modules required for the experiment, and load the PTB dataset. 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


batch_size, max_window_size, num_noise_words = 512, 5, 5 


data_iter, vocab = d21.load_data_ptb(batch_size, max_window_size, 
num_noise_words) 


Downloading ../data/ptb.zip from http://d21-data.s3-accelerate.amazonaws.com/ptb.zip... 





18 https://discuss.d21.ai/t/383 
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14.4.1 The Skip-Gram Model 


We will implement the skip-gram model by using embedding layers and minibatch multiplication. 
These methods are also often used to implement other natural language processing applications. 


Embedding Layer 


As described in Section 9.7, The layer in which the obtained word is embedded is called the em- 
bedding layer, which can be obtained by creating an nn. Embedding instance in high-level APIs. The 
weight of the embedding layer is a matrix whose number of rows is the dictionary size (input_dim) 
and whose number of columns is the dimension of each word vector (output_dim). We setthe dic- 
tionary size to 20 and the word vector dimension to 4. 


embed = nn.Embedding(input_dim=20, output_dim=4) 
embed.initialize() 
embed. weight 


Parameter embedding0_weight (shape=(20, 4), dtype=float32) 


The input of the embedding layer is the index of the word. When we enter the index i of a word, 
the embedding layer returns the i row of the weight matrix as its word vector. Below we enter 
an index of shape (2, 3) into the embedding layer. Because the dimension of the word vector is 4, 
we obtain a word vector of shape (2, 3, 4). 


x= np.array([[1, 2, 3], [4, 5, 611) 
embed(x) 


array([[[ 0.01438687, 0.05011239, 
[-0.01068833, 0.01729892, 
[-0.00873779, -0.02834515, 


© 


.00628365, 0.04861524], 
.02042518, -0.01618656], 
.05484822, -0.0620601811, 


oo 


CE 0.06491279, -0.03182812, - 
[ 0.0408415 , 0.04370362, 
[ 0.0 


0952624, -0.01501013, 


© 


.01631819, -0.00312688], 
.00404529, -0.0028032 ], 
.05958354, 0.04705103]]]) 


oo 


Skip-gram Model Forward Calculation 


In forward calculation, the input of the skip-gram model contains the central target word index 
center and the concatenated context and noise word index contexts_and_negatives. In which, 
the center variable has the shape (batch size, 1), while the contexts_and_negatives variable has 
the shape (batch size, max_len). These two variables are first transformed from word indexes to 
word vectors by the word embedding layer, and then the output of shape (batch size, 1, max_len) 
is obtained by minibatch multiplication. Each element in the output is the inner product of the 
central target word vector and the context word vector or noise word vector. 


def skip_gram(center, contexts_and_negatives, embed_v, embed_u): 
v = embed_v(center) 
u = embed_u(contexts_and_negatives) 


(continues on next page) 
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pred = npx.batch_dot(v, u.swapaxes(1, 2)) 
return pred 


Verify that the output shape should be (batch size, 1, max_len). 


skip_gram(np.ones((2, 1)), np.ones((2, 4)), embed, embed) .shape 


(2, 1, 4) 


14.4.2 Training 


Before training the word embedding model, we need to define the loss function of the model. 


Binary Cross Entropy Loss Function 


According to the definition of the loss function in negative sampling, we can directly use the binary 
cross-entropy loss function from high-level APIs. 


loss = gluon.loss.SigmoidBCELoss() 


It is worth mentioning that we can use the mask variable to specify the partial predicted value and 
label that participate in loss function calculation in the minibatch: when the mask is 1, the pre- 
dicted value and label of the corresponding position will participate in the calculation of the loss 
function; When the mask is 0, they do not participate. As we mentioned earlier, mask variables 
can be used to avoid the effect of padding on loss function calculations. 


Given two identical examples, different masks lead to different loss values. 
pred = np.array([[.5]*4]*2) 
label = np.array([[1., 0., 1., @.]]*2) 


mask = np.array([[1, 1, 1, 1], [1, 1, 0, 0]]) 
loss(pred, label, mask) 


array([0.724077 , 0.3620385]) 


We can normalize the loss in each example due to various lengths in each example. 


loss(pred, label, mask) / mask.sum(axis=1) * mask.shape[1] 


array([0.724077, @.724077]) 
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Initializing Model Parameters 


We construct the embedding layers of the central and context words, respectively, and set the 
hyperparameter word vector dimension embed_size to 100. 


embed_size = 100 

net = nn.Sequential() 

net.add(nn.Embedding(input_dim=len(vocab), output_dim=embed_size), 
nn.Embedding(input_dim=len(vocab), output_dim=embed_size) ) 


Training 


The training function is defined below. Because of the existence of padding, the calculation of the 
loss function is slightly different compared to the previous training functions. 


def train(net, data_iter, lr, num_epochs, device=d21.try_gpu()): 
net.initialize(ctx=device, force_reinit=True) 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, 
£'learning_rate': 1r)) 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[1, num_epochs]) 
metric = d21.Accumulator(2) + Sum of losses, no. of tokens 
for epoch in range(num_epochs): 
timer, num_batches = d21.Timer(), len(data_iter) 
for i, batch in enumerate(data_iter): 
center, context_negative, mask, label = [ 
data.as_in_ctx(device) for data in batch] 
with autograd.record(): 
pred = skip_gram(center, context_negative, net[0], net[1]) 
1 = (loss(pred.reshape(label.shape), label, mask) 
/ mask.sum(axis=1) * mask.shape[1]) 
1. backward() 
trainer.step(batch_size) 
metric.add(1.sum(), 1.size) 
if (i + 1) % (num_batches // 5) == @ or i == num_batches - 1: 
animator.add(epoch + (i + 1) / num_batches, 
(metric[0] / metric[1],)) 
print(f'loss (metric[0] / metric[1]:.3f), ’ 
f'(metric[1] / timer.stop():.1f} tokens/sec on (str(device))') 


Now, we can train a skip-gram model using negative sampling. 


lr, num_epochs = 0.01, 5 
train(net, data_iter, lr, num_epochs) 
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loss 0.373, 107559.5 tokens/sec on gpu(0) 


0.55 
0.50 
un 
un 
2 


0.45 


0.40 


epoch 


14.4.3 Applying the Word Embedding Model 


After training the word embedding model, we can represent similarity in meaning between words 
based on the cosine similarity of two word vectors. As we can see, when using the trained word 
embedding model, the words closest in meaning to the word “chip” are mostly related to chips. 


def 


get_ 


get_similar_tokens(query_token, k, embed): 
W = embed.weight.data() 
x = W[vocab[query_token]] 
# Compute the cosine similarity. Add 1e-9 for numerical stability 
cos = np.dot(W, x) / np.sqrt(np.sum(W * W, axis=1) * np.sum(x * x) + le-9) 
topk = npx.topk(cos, k=k+1, ret_typ='indices’).asnumpy().astype(’int32’) 
for i in topk[1:]: # Remove the input words 
print(f'cosine sim={float(cos[i]):.3f}: {vocab.idx_to_token[i]}’) 


similar_tokens(’chip’, 3, net[L0]) 


cosine sim=0.594: microprocessor 
cosine sim=0.494: intel 
cosine sim=0.478: desktop 
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Summary 


+ We can pretrain a skip-gram model through negative sampling. 


Exercises 
1. Set sparse_grad=True when creating an instance of nn.Embedding. Does it accelerate train- 
ing? Look up MXNet documentation to learn the meaning of this argument. 
2. Try to find synonyms for other words. 
3. Tune the hyperparameters and observe and analyze the experimental results. 


4. When the dataset is large, we usually sample the context words and the noise words for the 
central target word in the current minibatch only when updating the model parameters. In 
other words, the same central target word may have different context words or noise words 
in different epochs. What are the benefits of this sort of training? Try to implement this 
training method. 


Discussions!?? 


14.5 Word Embedding with Global Vectors (GloVe) 


First, we should review the skip-gram model in word2vec. The conditional probability P(w; | wi) 
expressed in the skip-gram model using the softmax operation will be recorded as q;;, that is: 
exp(uj v;) 


= (14.5.1) 
kev exp(u; v;) 





qij 


where v; and u; are the vector representations of word w; of index i as the center word and context 
word respectively, and V = {0,1,...,|V| — 1} is the vocabulary index set. 


For word w;, it may appear in the dataset for multiple times. We collect all the context words 
every time when w; is a center word and keep duplicates, denoted as multiset C;. The number 
of an element in a multiset is called the multiplicity of the element. For instance, suppose that 
word w; appears twice in the dataset: the context windows when these two w; become center 
words in the text sequence contain context word indices 2,1,5,2 and 2,3,2,1. Then, multiset 
Ci = {1,1,2,2,2,2,3,5}, where multiplicity of element 1 is 2, multiplicity of element 2 is 4, and 
multiplicities of elements 3 and 5 are both 1. Denote multiplicity of element j in multiset C; as x;;: 
it is the number of word w; in all the context windows for center word w; in the entire dataset. As 
a result, the loss function of the skip-gram model can be expressed in a different way: 


Y» aig log qy. (14.5.2) 
1eV jEV 


We add up the number of all the context words for the central target word w; to get x;, and record 
the conditional probability x;;/x; for generating context word w; based on central target word w; 
as pij. We can rewrite the loss function of the skip-gram model as 


— Y z; X pij log qij. (14.5.3) 


ev jEV 





1 https://discuss.d21.ai/t/384 
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In the formula above, > ;¿y Pij log qij computes the conditional probability distribution p,; for 
context word generation based on the central target word w; and the cross-entropy of conditional 
probability distribution q;; predicted by the model. The loss function is weighted using the sum 
of the number of context words with the central target word w;. If we minimize the loss function 
from the formula above, we will be able to allow the predicted conditional probability distribution 
to approach as close as possible to the true conditional probability distribution. 


However, although the most common type of loss function, the cross-entropy loss function is 
sometimes not a good choice. On the one hand, as we mentioned in Section 14.2 the cost of letting 
the model prediction q;; become the legal probability distribution has the sum of all items in the 
entire dictionary in its denominator. This can easily lead to excessive computational overhead. 
On the other hand, there are often a lot of uncommon words in the dictionary, and they appear 
rarely in the dataset. In the cross-entropy loss function, the final prediction of the conditional 
probability distribution on a large number of uncommon words is likely to be inaccurate. 


14.5.1 The GloVe Model 


To address this, GloVe (Pennington et al., 2014), a word embedding model that came after 
word2vec, adopts squared loss and makes three changes to the skip-gram model based on this 
loss. 


1. Here, we use the non-probability distribution variables p;, = x;; and q;, = exp (uj; v;) and 
2 2 
take their logs. Therefore, we get the squared loss (log pi¡ — log d) = (uv, — log wij) 6 


2. We add two scalar model parameters for each word w;: the bias terms b; (for central target 
words) and c;( for context words). 


3. Replace the weight of each loss with the function h(x;;). The weight function h(x) is a mono- 
tone increasing function with the range [0, 1]. 


Therefore, the goal of GloVe is to minimize the loss function. 


Y Y Mig) (ujv +b; + cj — log iy) he (14.5.4) 


iEV jEV 


Here, we have a suggestion for the choice of weight function h(x): when zx < c (e.g c = 100), make 
h(x) = (x/c)® (e.g a = 0.75), otherwise make h(x) = 1. Because h(0) = 0, the squared loss term 
for x;; = O can be simply ignored. When we use minibatch SGD for training, we conduct random 
sampling to get a non-zero minibatch x;¿ from each time step and compute the gradient to update 
the model parameters. These non-zero x;; are computed in advance based on the entire dataset 
and they contain global statistics for the dataset. Therefore, the name GloVe is taken from “Global 
Vectors”. 


Notice that if word w; appears in the context window of word w;, then word w; will also appear in 
the context window of word w;. Therefore, x;; = x;;. Unlike word2vec, GloVe fits the symmetric 
log x;; in lieu of the asymmetric conditional probability p;;. Therefore, the central target word 
vector and context word vector of any word are equivalent in GloVe. However, the two sets of word 
vectors that are learned by the same word may be different in the end due to different initialization 
values. After learning all the word vectors, GloVe will use the sum of the central target word vector 
and the context word vector as the final word vector for the word. 
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14.5.2 Understanding GloVe from Conditional Probability Ratios 


We can also try to understand GloVe word embedding from another perspective. We will continue 
the use of symbols from earlier in this section, P(w; | w;) represents the conditional probability 
of generating context word w; with central target word w; in the dataset, and it will be recorded as 
pij. From a real example from a large corpus, here we have the following two sets of conditional 
probabilities with “ice” and “steam” as the central target words and the ratio between them: 





WÍ= solid gas water | fashion 
pı = P(w; | ice) 0.00019 | 0.000066 | 0.003 | 0.000017 
p2 = P(w; | steam) | 0.000022 | 0.00078 | 0.0022 | 0.000018 
p1/p2 8.9 0.085 1.36 | 0.96 
































We will be able to observe phenomena such as: 


e For a word wy, that is related to “ice” but not to “steam”, such as wz = solid, we would expect 
a larger conditional probability ratio, like the value 8.9 in the last row of the table above. 


e For a word wy, that is related to “steam” but not to “ice”, such as wg = gas, we would expect a 
smaller conditional probability ratio, like the value 0.085 in the last row of the table above. 


e For a word wx that is related to both “ice” and “steam”, such as wọ = water, we would expect 
a conditional probability ratio close to 1, like the value 1.36 in the last row of the table above. 


e For a word w, that is related to neither “ice” or “steam”, such as wọ = fashion, we would 
expect a conditional probability ratio close to 1, like the value 0.96 in the last row of the table 
above. 


We can see that the conditional probability ratio can represent the relationship between different 
words more intuitively. We can construct a word vector function to fit the conditional probability 
ratio more effectively. As we know, to obtain any ratio of this type requires three words 1, wj, 
and wp. The conditional probability ratio with w; as the central target word is p;;/p;,. We can find 
a function that uses word vectors to fit this conditional probability ratio. 

Pij 


f (uj, uk, vi) = PA 


(14.5.5) 
The possible design of function f here will not be unique. We only need to consider a more reason- 
able possibility. Notice that the conditional probability ratio is a scalar, we can limit f to be a scalar 
function: f(u; ug, v;) = f ((u; — ux) vi). After exchanging index j with k, we will be able to see 
that function f satisfies the condition f(x) f(—x) = 1, so one possibility could be f(x) = exp(z). 
Thus: 


exp (w~) y Pi (14.5.6) 
exp (u} vi) Pik 








f (uj, Uk, Vi) = 


One possibility that satisfies the right side of the approximation sign is exp (uv) = apjj, where 


a is a constant. Considering that pi; = «;;/x;, after taking the logarithm we get uj Vi = log a+ 
log x;;—log xi. We use additional bias terms to fit — log a+log x;, such as the central target word 
bias term b; and context word bias term cj: 


ujv; +b; + Cj & log(x;;). (14.5.7) 


By taking the square error and weighting the left and right sides of the formula above, we can get 
the loss function of GloVe. 
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Summary 


e In some cases, the cross-entropy loss function may have a disadvantage. GloVe uses squared 
loss and the word vector to fit global statistics computed in advance based on the entire 
dataset. 


* The central target word vector and context word vector of any word are equivalent in GloVe. 


Exercises 


1. If a word appears in the context window of another word, how can we use the distance be- 
tween them in the text sequence to redesign the method for computing the conditional prob- 
ability p;;? Hint: See section 4.2 from the paper GloVe (Pennington et al., 2014). 


2. For any word, will its central target word bias term and context word bias term be equivalent 
to each other in GloVe? Why? 


Discussions?%% 


14.6 Subword Embedding 


English words usually have internal structures and formation methods. For example, we can de- 
duce the relationship between “dog”, “dogs”, and “dogcatcher” by their spelling. All these words 
have the same root, “dog”, but they use different suffixes to change the meaning of the word. 
Moreover, this association can be extended to other words. For example, the relationship be- 
tween “dog” and “dogs” is just like the relationship between “cat” and “cats”. The relationship 
between “boy” and “boyfriend” is just like the relationship between “girl” and “girlfriend”. This 
characteristic is not unique to English. In French and Spanish, a lot of verbs can have more than 
40 different forms depending on the context. In Finnish, a noun may have more than 15 forms. In 
fact, morphology, which is an important branch of linguistics, studies the internal structure and 
formation of words. 


14.6.1 fastText 


In word2vec, we did not directly use morphology information. In both the skip-gram model and 
continuous bag-of-words model, we use different vectors to represent words with different forms. 
For example, “dog” and “dogs” are represented by two different vectors, while the relationship 
between these two vectors is not directly represented in the model. In view of this, fastText (Bo- 
janowski et al., 2017) proposes the method of subword embedding, thereby attempting to intro- 
duce morphological information in the skip-gram model in word2vec. 


In fastText, each central word is represented as a collection of subwords. Below we use the word 
“where” as an example to understand how subwords are formed. First, we add the special charac- 
ters “<” and “>” at the beginning and end of the word to distinguish the subwords used as prefixes 
and suffixes. Then, we treat the word as a sequence of characters to extract the n-grams. For 
example, when n = 3, we can get all subwords with a length of 3: 


"<wh", "whe", "her", "ere", "re>", (14.6.1) 





20 https://discuss.d21.ai/t/385 
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and the special subword "<where>". 


In fastText, for a word w, we record the union of all its subwords with length of 3 to 6 and special 
subwords as Gw. Thus, the dictionary is the union of the collection of subwords of all words. 
Assume the vector of the subword g in the dictionary is z,. Then, the central word vector u,, for 
the word w in the skip-gram model can be expressed as 


Wu = y Zg. (14.6.2) 


gEGw 


The rest of the fastText process is consistent with the skip-gram model, so it is not repeated here. 
As we can see, compared with the skip-gram model, the dictionary in fastText is larger, resulting 
in more model parameters. Also, the vector of one word requires the summation of all subword 
vectors, which results in higher computation complexity. However, we can obtain better vectors 
for more uncommon complex words, even words not existing in the dictionary, by looking at other 
words with similar structures. 


14.6.2 Byte Pair Encoding 


In fastText, all the extracted subwords have to be of the specified lengths, such as 3 to 6, thus the 
vocabulary size cannot be predefined. To allow for variable-length subwords in a fixed-size vocab- 
ulary, we can apply a compression algorithm called byte pair encoding (BPE) to extract subwords 
(Sennrich et al., 2015). 


Byte pair encoding performs a statistical analysis of the training dataset to discover common sym- 
bols within a word, such as consecutive characters of arbitrary length. Starting from symbols of 
length 1, byte pair encoding iteratively merges the most frequent pair of consecutive symbols to 
produce new longer symbols. Note that for efficiency, pairs crossing word boundaries are not con- 
sidered. In the end, we can use such symbols as subwords to segment words. Byte pair encoding 
and its variants has been used for input representations in popular natural language processing 
pretraining models such as GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019). In the 
following, we will illustrate how byte pair encoding works. 


First, we initialize the vocabulary of symbols as all the English lowercase characters, a special 
end-of-word symbol '_', and a special unknown symbol '[UNK]’. 


import collections 


symbols = Ears. 15)" Gu dee cour ae eae a” ele ie la cali me 
Nets 


n, o, Da aaa ro, S, pe uU, V, W, X, y, Z, 
TEN "TUNK] "J 


Since we do not consider symbol pairs that cross boundaries of words, we only need a dictionary 
raw_token_freqs that maps words to their frequencies (number of occurrences) in a dataset. Note 
that the special symbol '_’ is appended to each word so that we can easily recover a word sequence 
(e.g., “a taller man”) from a sequence of output symbols ( e.g., “a_ tall er_ man”). Since we start 
the merging process from a vocabulary of only single characters and special symbols, space is 
inserted between every pair of consecutive characters within each word (keys of the dictionary 
token_freqs). In other words, space is the delimiter between symbols within a word. 


raw_token_fregs = {'fast_’: 4, 'faster_': 3, 'tall_': 5, 'taller_': 4} 
token_fregs = {} 


(continues on next page) 
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(continued from previous page) 


for token, freq in raw_token_freqs.items(): 
token_fregs[' '.join(list(token))] = raw_token_freqs[ token] 
token_freqs 


A A Chita er Siecle O ata e mess 4) 


We define the following get_max_freq_pair function that returns the most frequent pair of con- 
secutive symbols within a word, where words come from keys of the input dictionary token_freqs. 


def get_max_freg_pair(token_fregs): 
pairs = collections.defaultdict (int) 
for token, freq in token_freqs.items(): 
symbols = token.split() 
for i in range(len(symbols) - 1): 
# Key of ‘pairs’ is a tuple of two consecutive symbols 
pairs[symbols[il, symbols[i + 1]] += freq 
return max(pairs, key=pairs.get) + Key of ‘pairs* with the max value 


As a greedy approach based on frequency of consecutive symbols, byte pair encoding will use 
the following merge_symbols function to merge the most frequent pair of consecutive symbols to 
produce new symbols. 


def merge_symbols(max_freq_pair, token_freqs, symbols): 
symbols.append(’’ . join(max_freq_pair)) 
new_token_freqs = dict() 
for token, freq in token_freqs.items(): 
new_token = token.replace(’ '.join(max_freq_pair), 
'* join(max_freg_pair)) 
new_token_freqs[new_token] = token_fregs[ token] 
return new_token_freqs 


Now we iteratively perform the byte pair encoding algorithm over the keys of the dictionary to- 
ken_freqs. In the first iteration, the most frequent pair of consecutive symbols are 't' and 'a”, 
thus byte pair encoding merges them to produce a new symbol 'ta'. In the second iteration, byte 
pair encoding continues to merge 'ta' and '1' to result in another new symbol 'tal'. 


num_merges = 10 

for i in range(num_merges): 
max_freg_pair = get_max_freq_pair(token_freqs) 
token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols) 
print(f’merge #{i + 1):', max_freq_pair) 


merge #1: ('t', 'a') 
merge #2: ('ta', '1') 
merge #3: ('tal', '1') 
merge #4: ('f’, 'a') 
merge #5: ('fa', 's') 
merge #6: ('fas', 't') 
merge #7: (’e’, 'r') 
merge #8: (’er’, '_') 
merge #9: ('tall', '_’) 
merge #10: ('fast’, '_') 
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After 10 iterations of byte pair encoding, we can see that list symbols now contains 10 more sym- 
bols that are iteratively merged from other symbols. 


print(symbols) 


L'a” ¿DAS CH “el”. le”, HAS Uae A? Cale ES Se Aes mA ms HO} 718)" 5 "El" GS ÉS 
aa eos “ur, NYY mw”, XS Wp rains ae "LUNK]’, vitae: al”, EIA arak, ES e Fast 
ta meta Sits ll 


For the same dataset specified in the keys of the dictionary raw_token_freqs, each word in the 
dataset is now segmented by subwords “fast_”, “fast”, “er_”, “tall_”, and “tall” as a result of the byte 
pair encoding algorithm. For instance, words “faster_” and “taller_” are segmented as “fast er_” 
and “tall er_”, respectively. 


print(list(token_fregs.keys())) 


ree, ARISE Gr Ea ella 


Note that the result of byte pair encoding depends on the dataset being used. We can also use the 
subwords learned from one dataset to segment words of another dataset. As a greedy approach, 
the following segment_BPE function tries to break words into the longest possible subwords from 
the input argument symbols. 


def segment_BPE(tokens, symbols): 
outputs = [] 
for token in tokens: 
start, end = 0, len(token) 
cur_output = [] 
# Segment token with the longest possible subwords from symbols 
while start < len(token) and start < end: 
if token[start: end] in symbols: 
cur_output.append(token[start: end]) 
start = end 
end = len(token) 
else: 
end -= 1 
if start < len(token): 
cur_output.append('[UNK]’) 
outputs.append(’ '.join(cur_output)) 
return outputs 


In the following, we use the subwords in list symbols, which is learned from the aforementioned 
dataset, to segment tokens that represent another dataset. 


tokens = ['tallest_', 'fatter_'] 
print(segment_BPE(tokens, symbols)) 


Ete ese. torera 
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Summary 


e FastText proposes a subword embedding method. Based on the skip-gram model in 
word2vec, it represents the central word vector as the sum of the subword vectors of the 
word. 


e Subword embedding utilizes the principles of morphology, which usually improves the qual- 
ity of representations of uncommon words. 


+ Byte pair encoding performs a statistical analysis of the training dataset to discover common 
symbols within a word. As a greedy approach, byte pair encoding iteratively merges the most 
frequent pair of consecutive symbols. 


Exercises 


1. When there are too many subwords (for example, 6 words in English result in about 3 x 10° 
combinations), what problems arise? Can you think of any methods to solve them? Hint: 
Refer to the end of section 3.2 of the fastText paper (Bojanowski et al., 2017). 


2. How can you design a subword embedding model based on the continuous bag-of-words 
model? 


3. To get a vocabulary of size m, how many merging operations are needed when the initial 
symbol vocabulary size is n? 


4. How can we extend the idea of byte pair encoding to extract phrases? 


Discussions?’ 


14.7 Finding Synonyms and Analogies 


In Section 14.4 we trained a word2vec word embedding model on a small-scale dataset and 
searched for synonyms using the cosine similarity of word vectors. In practice, word vectors pre- 
trained on a large-scale corpus can often be applied to downstream natural language processing 
tasks. This section will demonstrate how to use these pretrained word vectors to find synonyms 
and analogies. We will continue to apply pretrained word vectors in subsequent sections. 


from d21 import mxnet as d21 
from mxnet import np, npx 


import os 


npx.set_np() 





202 https://discuss.d21.ai/t/386 
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14.7.1 Using Pretrained Word Vectors 


Below lists pretrained GloVe embeddings of dimensions 50, 100, and 300, which can be down- 
loaded from the GloVe website?, The pretrained fastText embeddings are available in multiple 
languages. Here we consider one English version (300-dimensional “wiki.en”) that can be down- 
loaded from the fastText website?%, 


#@save 
d21.DATA_HUB[ 'glove.6b.50d'] = (d21.DATA_URL + 'glove.6B.50d.zip', 
"0b8703943ccdb6eb788e6f091b8946e82231bc4d') 


#@save 
d21.DATA_HUBL’glove.6b.100d’] = (d21.DATA_URL + 'glove.6B.100d.zip’, 
"cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a') 


#@save 
d21.DATA_HUB[ 'glove.42b.300d'] = (d21.DATA_URL + 'glove.42B.30Q@d.zip’, 
"b5116e234e9eb9076672cfeabf5469f3eec904fa') 


#@save 
d21.DATA_HUB[ 'wiki.en'] = (d21.DATA_URL + 'wiki.en.zip', 
'c1816da3821ae9f43899be655002f6c723e91b88'>) 


We define the following TokenEmbedding class to load the above pretrained Glove and fastText em- 
beddings. 


#@save 
class TokenEmbedding: 
"""Token Embedding.”"” 
def __init__(self, embedding_name): 
self.idx_to_token, self.idx_to_vec = self._load_embedding( 
embedding_name) 
self .unknown_idx = ð 
self.token_to_idx = (token: idx for idx, token in 
enumerate(self.idx_to_token) } 


def _load_embedding(self, embedding_name): 
idx_to_token, idx_to_vec = ['<unk>'], [] 
data_dir = d21.download_extract (embedding_name) 
# GloVe website: https://nlp.stanford.edu/projects/glove/ 
# fastText website: https://fasttext.cc/ 
with open(os.path.join(data_dir, 'vec.txt'), 'r') as f: 
for line in f: 
elems = line.rstrip().split(’ ') 
token, elems = elems[0], [float(elem) for elem in elems[1:]] 
# Skip header information, such as the top row in fastText 
if len(elems) > 1: 
idx_to_token. append(token) 
idx_to_vec. append(elems) 
idx_to_vec = [[0] x len(idx_to_vec[01)] + idx_to_vec 
return idx_to_token, np.array(idx_to_vec) 


(continues on next page) 
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(continued from previous page) 


def __getitem__(self, tokens): 
indices = [self.token_to_idx.get(token, self.unknown_idx) 
for token in tokens] 
vecs = self.idx_to_vec[np.array(indices)] 
return vecs 


def __len__(self): 
return len(self.idx_to_token) 


Next, we use 50-dimensional GloVe embeddings pretrained on a subset of the Wikipedia. The 
corresponding word embedding is automatically downloaded the first time we create a pretrained 
word embedding instance. 


glove_6b50d = TokenEmbedding('glove.6b.50d’) 


Downloading ../data/glove.6B.50d.zip from http://d21-data.s3-accelerate.amazonaws.com/glove. 
>26B.50d.zip... 


Output the dictionary size. The dictionary contains 400, 000 words and a special unknown token. 


len(glove_6b50d) 


400001 


We can use a word to get its index in the dictionary, or we can get the word from its index. 


glove_6b50d.token_to_idx[ 'beautiful'], glove_6b50d.idx_to_token[3367] 


(3367, 'beautiful')> 


14.7.2 Applying Pretrained Word Vectors 


Below, we demonstrate the application of pretrained word vectors, using GloVe as an example. 


Finding Synonyms 


Here, we re-implement the algorithm used to search for synonyms by cosine similarity introduced 
in Section 14.1 


In order to reuse the logic for seeking the k nearest neighbors when seeking analogies, we encap- 
sulate this part of the logic separately in the knn (k-nearest neighbors) function. 


def knn(W, x, k): 
# The added 1e-9 is for numerical stability 
cos = np.dot(W, x.reshape(-1,)) / ( 
np.sqrt(np.sum(W * W, axis=1) + le-9) * np.sqrt((x * x).sum())) 
topk = npx.topk(cos, k=k, ret_typ='indices’) 
return topk, [cos[int(i)] for i in topk] 
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Then, we search for synonyms by pre-training the word vector instance embed. 


def get_similar_tokens(query_token, k, embed): 
topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1) 
for i, c in zip(topk[1:], cos[1:]): + Remove input words 
print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}’) 


The dictionary of pretrained word vector instance glove_6b50d already created contains 400,000 
words and a special unknown token. Excluding input words and unknown words, we search for 
the three words that are the most similar in meaning to “chip”. 


get_similar_tokens('chip', 3, glove_6b50d) 


cosine sim=0.856: chips 
cosine sim=0.749: intel 
cosine sim=0.749: electronics 


Next, we search for the synonyms of “baby” and “beautiful”. 


get_similar_tokens('baby’, 3, glove_6b50d) 


cosine sim=0.839: babies 
cosine sim=0.800: boy 
cosine sim=0.792: girl 


get_similar_tokens('beautiful', 3, glove_6b50d) 


cosine sim=0.921: lovely 
cosine sim=0.893: gorgeous 
cosine sim=0.830: wonderful 


Finding Analogies 


In addition to seeking synonyms, we can also use the pretrained word vector to seek the analo- 
gies between words. For example, “man”:*“woman”::“son”:“daughter” is an example of analogy, 
“man” is to “woman” as “son” is to “daughter”. The problem of seeking analogies can be defined 
as follows: for four words in the analogical relationship a : b :: c : d, given the first three words, 
a, b and c, we want to find d. Assume the word vector for the word w is vec(w). To solve the 
analogy problem, we need to find the word vector that is most similar to the result vector of 


vec(c) + vec(b) — vec(a). 


def get_analogy(token_a, token_b, token_c, embed): 
vecs = embed[[token_a, token_b, token_c]] 
x = vecs[1] - vecs[0] + vecs[2] 
topk, cos = knn(embed.idx_to_vec, x, 1) 
return embed. idx_to_tokenLint(topkl0])] + Remove unknown words 


Verify the “male-female” analogy. 
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get_analogy('man', ‘woman’, 'son', glove_6b50d) 


'daughter” 


“Capital-country” analogy: “beijing” is to “china” as “tokyo” is to what? The answer should be 
obs ” 
japan’. 


get_analogy('beijing', ‘china’, ‘tokyo’, glove_6b50d) 
‘japan’ 


“Adjective-superlative adjective” analogy: “bad” is to “worst” as “big” is to what? The answer 
should be “biggest”. 


get_analogy('bad', ‘worst’, 'big', glove_6b50d) 


'biggest' 


“Present tense verb-past tense verb” analogy: “do” is to “did” as “go” is to what? The answer should 
be “went”. 


get_analogy('do', 'did', 'go', glove_6b50d) 


"went’ 


Summary 
e Word vectors pre-trained on a large-scale corpus can often be applied to downstream natural 
language processing tasks. 


e We can use pre-trained word vectors to seek synonyms and analogies. 


Exercises 


1. Test the fastText results using TokenEmbedding('wiki.en'). 


2. Ifthe dictionary is extremely large, how can we accelerate finding synonyms and analogies? 


Discussions?%* 
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14.8 Bidirectional Encoder Representations from Transformers (BERT) 


We have introduced several word embedding models for natural language understanding. Af- 
ter pretraining, the output can be thought of as a matrix where each row is a vector that repre- 
sents a word of a predefined vocabulary. In fact, these word embedding models are all context- 
independent. Let us begin by illustrating this property. 


14.8.1 From Context-Independent to Context-Sensitive 


Recall the experiments in Section 14.4 and Section 14.7. For instance, word2vec and GloVe both 
assign the same pretrained vector to the same word regardless of the context of the word (if any). 
Formally, a context-independent representation of any token x is a function f(x) that only takes 
x as its input. Given the abundance of polysemy and complex semantics in natural languages, 
context-independent representations have obvious limitations. For instance, the word “crane” in 
contexts “a crane is flying” and “a crane driver came” has completely different meanings; thus, 
the same word may be assigned different representations depending on contexts. 


This motivates the development of context-sensitive word representations, where representations 
of words depend on their contexts. Hence, a context-sensitive representation of token z is a func- 
tion f(x, c(a)) depending on both zx and its context c(x). Popular context-sensitive representations 
include TagLM (language-model-augmented sequence tagger) (Peters et al., 2017b), CoVe (Con- 
text Vectors) (McCann et al., 2017), and ELMo (Embeddings from Language Models) (Peters et al., 
2018). 


For example, by taking the entire sequence as the input, ELMo is a function that assigns a rep- 
resentation to each word from the input sequence. Specifically, ELMo combines all the inter- 
mediate layer representations from pretrained bidirectional LSTM as the output representation. 
Then the ELMo representation will be added to a downstream task’s existing supervised model as 
additional features, such as by concatenating ELMo representation and the original representa- 
tion (e.g., GloVe) of tokens in the existing model. On one hand, all the weights in the pretrained 
bidirectional LSTM model are frozen after ELMo representations are added. On the other hand, 
the existing supervised model is specifically customized for a given task. Leveraging different 
best models for different tasks at that time, adding ELMo improved the state of the art across six 
natural language processing tasks: sentiment analysis, natural language inference, semantic role 
labeling, coreference resolution, named entity recognition, and question answering. 


14.8.2 From Task-Specific to Task-Agnostic 


Although ELMo has significantly improved solutions to a diverse set of natural language process- 
ing tasks, each solution still hinges on a task-specific architecture. However, it is practically non- 
trivial to craft a specific architecture for every natural language processing task. The GPT (Gen- 
erative Pre-Training) model represents an effort in designing a general task-agnostic model for 
context-sensitive representations (Radford et al., 2018). Built on a Transformer decoder, GPT pre- 
trains a language model that will be used to represent text sequences. When applying GPT to a 
downstream task, the output of the language model will be fed into an added linear output layer to 
predict the label of the task. In sharp contrast to ELMo that freezes parameters of the pretrained 
model, GPT fine-tunes all the parameters in the pretrained Transformer decoder during super- 
vised learning of the downstream task. GPT was evaluated on twelve tasks of natural language 
inference, question answering, sentence similarity, and classification, and improved the state of 
the art in nine of them with minimal changes to the model architecture. 





14.8. Bidirectional Encoder Representations from Transformers (BERT) 695 


However, due to the autoregressive nature of language models, GPT only looks forward (left-to- 
right). In contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”, as 
“bank” is sensitive to the context to its left, GPT will return the same representation for “bank”, 
though it has different meanings. 


14.8.3 BERT: Combining the Best of Both Worlds 


As we have seen, ELMo encodes context bidirectionally but uses task-specific architectures; while 
GPT is task-agnostic but encodes context left-to-right. Combining the best of both worlds, BERT 
(Bidirectional Encoder Representations from Transformers) encodes context bidirectionally and 
requires minimal architecture changes for a wide range of natural language processing tasks (De- 
vlin et al., 2018). Using a pretrained Transformer encoder, BERT is able to represent any token 
based on its bidirectional context. During supervised learning of downstream tasks, BERT is sim- 
ilar to GPT in two aspects. First, BERT representations will be fed into an added output layer, 
with minimal changes to the model architecture depending on nature of tasks, such as predict- 
ing for every token vs. predicting for the entire sequence. Second, all the parameters of the pre- 
trained Transformer encoder are fine-tuned, while the additional output layer will be trained from 
scratch. Fig. 14.8.1 depicts the differences among ELMo, GPT, and BERT. 


Label(s) of the task Label(s) of the task Label(s) of the task 


Architecture crafted 
for the given task 





Pretraining 
& fine-tuning 





Pretraining 


Token, ... Tokeny Token, ... Tokeny Token, ... Tokeny 
ELMo GPT BERT 


Fig. 14.8.1: A comparison of ELMo, GPT, and BERT. 


BERT further improved the state of the art on eleven natural language processing tasks under 
broad categories of i) single text classification (e.g., sentiment analysis), ii) text pair classifica- 
tion (e.g., natural language inference), iii) question answering, iv) text tagging (e.g., named entity 
recognition). All proposed in 2018, from context-sensitive ELMo to task-agnostic GPT and BERT, 
conceptually simple yet empirically powerful pretraining of deep representations for natural lan- 
guages have revolutionized solutions to various natural language processing tasks. 


In the rest of this chapter, we will dive into the pretraining of BERT. When natural language pro- 
cessing applications are explained in Chapter 15, we will illustrate fine-tuning of BERT for down- 
stream applications. 
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from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
from mxnet.gluon import nn 


npx.set_np() 


14.8.4 Input Representation 


In natural language processing, some tasks (e.g., sentiment analysis) take single text as the input, 
while in some other tasks (e.g., natural language inference), the input is a pair of text sequences. 
The BERT input sequence unambiguously represents both single text and text pairs. In the former, 
the BERT input sequence is the concatenation of the special classification token “<cls>”, tokens of 
a text sequence, and the special separation token “<sep>”. In the latter, the BERT input sequence 
is the concatenation of “<cls>”, tokens of the first text sequence, “<sep>”, tokens of the second text 
sequence, and “<sep>”. We will consistently distinguish the terminology “BERT input sequence” 
from other types of “sequences”. For instance, one BERT input sequence may include either one 
text sequence or two text sequences. 


To distinguish text pairs, the learned segment embeddings e4 and eg are added to the token em- 
beddings of the first sequence and the second sequence, respectively. For single text inputs, only 
e4 is used. 


The following get_tokens_and_segments takes either one sentence or two sentences as the input, 
then returns tokens of the BERT input sequence and their corresponding segment IDs. 


#@save 
def get_tokens_and_segments(tokens_a, tokens_b=None): 
tokens = ['<cls>’] + tokens_a + [’<sep>'] 
# @ and 1 are marking segment A and B, respectively 
segments = [0] * (len(tokens_a) + 2) 
if tokens_b is not None: 
tokens += tokens_b + ['<sep>'] 
segments += [1] * (len(tokens_b) + 1) 
return tokens, segments 


BERT chooses the Transformer encoder as its bidirectional architecture. Common in the Trans- 
former encoder, positional embeddings are added at every position of the BERT input sequence. 
However, different from the original Transformer encoder, BERT uses learnable positional em- 
beddings. To sum up, Fig. 14.8.2 shows that the embeddings of the BERT input sequence are the 
sum of the token embeddings, segment embeddings, and positional embeddings. 
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Input <cls> this movie is great <sep> i like it <sep> 


Token 
Embeddings € ols> this ĉ novie €is È reat E csep> e, Ciike Ci e <sep> 
+ + + + + + + + + + 
Segment 
Embeddings “a ea êa ea ea ea 
+ + + + + + + + + + 
Positional 
Embeddings 


Fig. 14.8.2: The embeddings of the BERT input sequence are the sum of the token embeddings, 
segment embeddings, and positional embeddings. 


The following BERTEncoder class is similar to the TransformerEncoder class as implemented in Sec- 
tion 10.7. Different from TransformerEncoder, BERTEncoder uses segment embeddings and learn- 
able positional embeddings. 


#@save 
class BERTEncoder(nn.Block): 
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout, max_len=1000, xxkwargs): 
super(BERTEncoder, self).__init__(**kwargs) 
self.token_embedding = nn.Embedding(vocab_size, num_hiddens) 
self.segment_embedding = nn.Embedding(2, num_hiddens) 
self.blks = nn.Sequential() 
for _ in range(num_layers): 
self .blks.add(d21.EncoderBlock( 
num_hiddens, ffn_num_hiddens, num_heads, dropout, True)) 
# In BERT, positional embeddings are learnable, thus we create a 
# parameter of positional embeddings that are long enough 
self.pos_embedding = self.params.get('pos_embedding’, 
shape=(1, max_len, num_hiddens)) 


def forward(self, tokens, segments, valid_lens): 
# Shape of `X` remains unchanged in the following code snippet: 
# (batch size, max sequence length, ‘num_hiddens*) 
X = self.token_embedding(tokens) + self.segment_embedding(segments) 
X = X + self.pos_embedding.data(ctx=X.ctx)[:, :X.shape[1], :] 
for blk in self.blks: 
X = blk(X, valid_lens) 
return X 


Suppose that the vocabulary size is 10,000. To demonstrate forward inference of BERTEncoder, let 
us create an instance of it and initialize its parameters. 


vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4 

num_layers, dropout = 2, 0.2 

encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout) 

encoder. initialize() 


We define tokens to be 2 BERT input sequences of length 8, where each token is an index of the vo- 
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cabulary. The forward inference of BERTEncoder with the input tokens returns the encoded result 
where each token is represented by a vector whose length is predefined by the hyperparameter 
num_hiddens. This hyperparameter is usually referred to as the hidden size (number of hidden 
units) of the Transformer encoder. 


tokens = np.random.randint(0, vocab_size, (2, 8)) 

segments = np.array([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0,0, 1, 1, 1, 1, 117) 
encoded_X = encoder(tokens, segments, None) 

encoded_X.shape 


(25 Sy OS») 


14.8.5 Pretraining Tasks 


The forward inference of BERTEncoder gives the BERT representation of each token of the input 
text and the inserted special tokens “<cls>” and “<seq>”. Next, we will use these representations 
to compute the loss function for pretraining BERT. The pretraining is composed of the following 
two tasks: masked language modeling and next sentence prediction. 


Masked Language Modeling 


As illustrated in Section 8.3, a language model predicts a token using the context on its left. To 
encode context bidirectionally for representing each token, BERT randomly masks tokens and 
uses tokens from the bidirectional context to predict the masked tokens. This task is referred to 
as a masked language model. 


In this pretraining task, 15% of tokens will be selected at random as the masked tokens for predic- 
tion. To predict a masked token without cheating by using the label, one straightforward approach 
is to always replace it with a special “<mask>” token in the BERT input sequence. However, the 
artificial special token “<mask>” will never appear in fine-tuning. To avoid such a mismatch be- 
tween pretraining and fine-tuning, if a token is masked for prediction (e.g., “great” is selected to 
be masked and predicted in “this movie is great”), in the input it will be replaced with: 


e a special “<mask>” token for 80% of the time (e.g., “this movie is great” becomes “this movie 
is <mask>”); 


e arandom token for 10% of the time (e.g., “this movie is great” becomes “this movie is drink”); 


e the unchanged label token for 10% of the time (e.g., “this movie is great” becomes “this 
movie is great”). 


Note that for 10% of 15% time a random token is inserted. This occasional noise encourages BERT 
to be less biased towards the masked token (especially when the label token remains unchanged) 
in its bidirectional context encoding. 


We implement the following MaskLM class to predict masked tokens in the masked language model 
task of BERT pretraining. The prediction uses a one-hidden-layer MLP (self.mlp). In forward 
inference, it takes two inputs: the encoded result of BERTEncoder and the token positions for pre- 
diction. The output is the prediction results at these positions. 
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#@save 
class MaskLM(nn.Block): 
def __init__(self, vocab_size, num_hiddens, **kwargs): 
super (MaskLM, self).__init__(**kwargs) 
self.mlp = nn.Sequential() 
self .mlp.add( 
nn.Dense(num_hiddens, flatten=False, activation='relu’)) 

self.mlp.add(nn.LayerNorm()) 
self.mlp.add(nn.Dense(vocab_size, flatten=False)) 


def forward(self, X, pred_positions): 
num_pred_positions = pred_positions.shape[1] 
pred_positions = pred_positions.reshape(-1) 
batch_size = X.shape[Q] 
batch_idx = np.arange(0, batch_size) 
# Suppose that ‘batch_size* = 2, ‘num_pred_positions* = 3, then 
ae Edic is “nando, O, O, i, 1, apy 
batch_idx = np.repeat(batch_idx, num_pred_positions) 
masked_X = X[batch_idx, pred_positions] 
masked_X = masked_X.reshape((batch_size, num_pred_positions, -1)) 
mlm_Y_hat = self.mlp(masked_X) 
return mlm_Y_hat 


To demonstrate the forward inference of MaskLM, we create its instance mlm and initialize it. Recall 
that encoded_X from the forward inference of BERTEncoder represents 2 BERT input sequences. 
We define mlm_positions as the 3 indices to predict in either BERT input sequence of encoded_X. 
The forward inference of mlm returns prediction results mlm_Y_hat at all the masked positions 
mlm_positions of encoded_X. For each prediction, the size of the result is equal to the vocabulary 
size. 


mlm = MaskLM(vocab_size, num_hiddens) 

mlm. initialize() 

mlm_positions = np.array([[1, 5, 2], [6, 1, 5]]) 
mlm_Y_hat = mlm(encoded_X, mlm_positions) 
mlm_Y_hat. shape 


(2, 3, 10000) 


With the ground truth labels mlm_Y of the predicted tokens mlm_Y_hat under masks, we can calcu- 
late the cross entropy loss of the masked language model task in BERT pretraining. 


mlm_Y = np.array([[7, 8, 9], [10, 20, 30]]) 

loss = gluon.loss.SoftmaxCrossEntropyLoss() 

mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1)) 
m1m_1.shape 


(6,) 
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Next Sentence Prediction 


Although masked language modeling is able to encode bidirectional context for representing 
words, it does not explicitly model the logical relationship between text pairs. To help under- 
stand the relationship between two text sequences, BERT considers a binary classification task, 
next sentence prediction, in its pretraining. When generating sentence pairs for pretraining, for 
half of the time they are indeed consecutive sentences with the label “True”; while for the other 
half of the time the second sentence is randomly sampled from the corpus with the label “False”. 


The following NextSentencePred class uses a one-hidden-layer MLP to predict whether the second 
sentence is the next sentence of the first in the BERT input sequence. Due to self-attention in the 
Transformer encoder, the BERT representation of the special token “<cls>” encodes both the two 
sentences from the input. Hence, the output layer (self. output) of the MLP classifier takes X as 
the input, where X is the output of the MLP hidden layer whose input is the encoded “<cls>” token. 


#@save 
class NextSentencePred(nn.Block) : 
def __init__(self, **kwargs): 
super (NextSentencePred, self).__init__(**kwargs) 
self.output = nn.Dense(2) 


def forward(self, X): 
# ‘X* shape: (batch size, ‘num_hiddens*) 
return self.output(X) 


We can see that the forward inference of an NextSentencePred instance returns binary predictions 
for each BERT input sequence. 


nsp = NextSentencePred() 
nsp.initialize() 

nsp_Y_hat = nsp(encoded_X) 
nsp_Y_hat.shape 


(2, 2) 
The cross-entropy loss of the 2 binary classifications can also be computed. 


nsp_y = np.array([@, 1]) 
nsp_1 = loss(nsp_Y_hat, nsp_y) 
nsp_1.shape 


(2,) 


It is noteworthy that all the labels in both the aforementioned pretraining tasks can be trivially 
obtained from the pretraining corpus without manual labeling effort. The original BERT has been 
pretrained on the concatenation of BookCorpus (Zhu et al., 2015) and English Wikipedia. These 
two text corpora are huge: they have 800 million words and 2.5 billion words, respectively. 
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14.8.6 Putting All Things Together 


When pretraining BERT, the final loss function is a linear combination of both the loss functions 
for masked language modeling and next sentence prediction. Now we can define the BERTModel 
class by instantiating the three classes BERTEncoder, MaskLM, and NextSentencePred. The forward 
inference returns the encoded BERT representations encoded_X, predictions of masked language 
modeling mlm_Y_hat, and next sentence predictions nsp_Y_hat. 


#@save 
class BERTModel(nn.Block): 


def 


__init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout, max_len=1000): 

super(BERTModel, self).__init__() 

self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_layers, dropout, max_len) 

self.hidden = nn.Dense(num_hiddens, activation=’ tanh’) 

self.mlm = MaskLM(vocab_size, num_hiddens) 

self.nsp = NextSentencePred() 


def forward(self, tokens, segments, valid_lens=None, pred_positions=None): 


encoded_X = self.encoder(tokens, segments, valid_lens) 
if pred_positions is not None: 
mlm_Y_hat = self.mlm(encoded_X, pred_positions) 
else: 
mlm_Y_hat = None 
# The hidden layer of the MLP classifier for next sentence prediction. 
# 0 is the index of the '<cls>' token 
nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :])) 
return encoded_X, mlm_Y_hat, nsp_Y_hat 


Summary 


Word embedding models such as word2vec and GloVe are context-independent. They assign 
the same pretrained vector to the same word regardless of the context of the word (if any). 
It is hard for them to handle well polysemy or complex semantics in natural languages. 


For context-sensitive word representations such as ELMo and GPT, representations of words 
depend on their contexts. 


ELMo encodes context bidirectionally but uses task-specific architectures (however, it is 
practically non-trivial to craft a specific architecture for every natural language processing 
task); while GPT is task-agnostic but encodes context left-to-right. 


BERT combines the best of both worlds: it encodes context bidirectionally and requires min- 
imal architecture changes for a wide range of natural language processing tasks. 


The embeddings of the BERT input sequence are the sum of the token embeddings, segment 
embeddings, and positional embeddings. 


Pretraining BERT is composed of two tasks: masked language modeling and next sentence 
prediction. The former is able to encode bidirectional context for representing words, while 
the later explicitly models the logical relationship between text pairs. 
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Exercises 


1. Why does BERT succeed? 


2. All other things being equal, will a masked language model require more or fewer pretrain- 
ing steps to converge than a left-to-right language model? Why? 


3. In the original implementation of BERT, the positionwise feed-forward network in BERTEn- 
coder (via d21.EncoderBlock) and the fully-connected layer in MaskLM both use the Gaussian 
error linear unit (GELU) (Hendrycks & Gimpel, 2016) as the activation function. Research 
into the difference between GELU and ReLU. 


Discussions?” 


14.9 The Dataset for Pretraining BERT 


To pretrain the BERT model as implemented in Section 14.8, we need to generate the dataset in the 
ideal format to facilitate the two pretraining tasks: masked language modeling and next sentence 
prediction. On one hand, the original BERT model is pretrained on the concatenation of two huge 
corpora BookCorpus and English Wikipedia (see Section 14.8.5), making it hard to run for most 
readers of this book. On the other hand, the off-the-shelf pretrained BERT model may not fit for 
applications from specific domains like medicine. Thus, it is getting popular to pretrain BERT ona 
customized dataset. To facilitate the demonstration of BERT pretraining, we use a smaller corpus 
WikiText-2 (Merity et al., 2016). 


Comparing with the PTB dataset used for pretraining word2vec in Section 14.3, WikiText-2 i) re- 
tains the original punctuation, making it suitable for next sentence prediction; ii) retains the orig- 
inal case and numbers; iii) is over twice larger. 


from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
import os 

import random 


npx.set_np() 


In the WikiText-2 dataset, each line represents a paragraph where space is inserted between any 
punctuation and its preceding token. Paragraphs with at least two sentences are retained. To split 
sentences, we only use the period as the delimiter for simplicity. We leave discussions of more 
complex sentence splitting techniques in the exercises at the end of this section. 


#@save 

d21.DATA_HUB[ 'wikitext-2'] = ( 
'https://s3.amazonaws.com/research.metamind.io/wikitext/” 
'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe' ) 


#@save 
def _read_wiki(data_dir): 
file_name = os.path.join(data_dir, 'wiki.train.tokens') 
with open(file_name, 'r') as f: 
lines = f.readlines() 


(continues on next page) 
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# Uppercase letters are converted to lowercase ones 
paragraphs = [line.strip().lower().split(’ . ') 
for line in lines if len(line.split(’ . ')) >= 2] 
random. shuffle(paragraphs) 
return paragraphs 


14.9.1 Defining Helper Functions for Pretraining Tasks 


In the following, we begin by implementing helper functions for the two BERT pretraining tasks: 
next sentence prediction and masked language modeling. These helper functions will be invoked 
later when transforming the raw text corpus into the dataset of the ideal format to pretrain BERT. 


Generating the Next Sentence Prediction Task 


According to descriptions of Section 14.8.5, the _get_next_sentence function generates a training 
example for the binary classification task. 


#@save 
def _get_next_sentence(sentence, next_sentence, paragraphs): 
if random.random() < 0.5: 
is_next = True 
else: 
# ‘paragraphs: is a list of lists of lists 
next_sentence = random.choice(random.choice(paragraphs)) 
is_next = False 
return sentence, next_sentence, is_next 


The following function generates training examples for next sentence prediction from the input 
paragraph by invoking the _get_next_sentence function. Here paragraph is a list of sentences, 
where each sentence is a list of tokens. The argument max_len specifies the maximum length of a 
BERT input sequence during pretraining. 


#@save 
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len): 
nsp_data_from_paragraph = [] 
for i in range(len(paragraph) - 1): 
tokens_a, tokens_b, is_next = _get_next_sentence( 
paragraph[i], paragraph[i + 1], paragraphs) 
# Consider 1 '<cls>' token and 2 '<sep>' tokens 
if len(tokens_a) + len(tokens_b) + 3 > max_len: 
continue 
tokens, segments = d21.get_tokens_and_segments(tokens_a, tokens_b) 
nsp_data_from_paragraph. append((tokens, segments, is_next)) 
return nsp_data_from_paragraph 
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Generating the Masked Language Modeling Task 


In order to generate training examples for the masked language modeling task from a BERT input 
sequence, we define the following _replace_mlm_tokens function. In its inputs, tokens is a list of 
tokens representing a BERT input sequence, candidate_pred_positions is a list of token indices 
of the BERT input sequence excluding those of special tokens (special tokens are not predicted 
in the masked language modeling task), and num_mlm_preds indicates the number of predictions 
(recall 15% random tokens to predict). Following the definition of the masked language modeling 
task in Section 14.8.5, at each prediction position, the input may be replaced by a special “<mask>” 
token or a random token, or remain unchanged. In the end, the function returns the input tokens 
after possible replacement, the token indices where predictions take place and labels for these 
predictions. 


#@save 
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds, 
vocab): 
# Make a new copy of tokens for the input of a masked language model, 
# where the input may contain replaced '<mask>' or random tokens 
mlm_input_tokens = [token for token in tokens] 
pred_positions_and_labels = [] 
# Shuffle for getting 15% random tokens for prediction in the masked 
# language modeling task 
random. shuffle(candidate_pred_positions) 
for mlm_pred_position in candidate_pred_positions: 
if len(pred_positions_and_labels) >= num_mlm_preds: 
break 
masked_token = None 
+ 80% of the time: replace the word with the '<mask>' token 
if random.random() < 0.8: 
masked_token = '<mask>" 
else: 
# 10% of the time: keep the word unchanged 
if random.random() < 0.5: 
masked_token = tokens[mlm_pred_position] 
# 10% of the time: replace the word with a random word 
else: 
masked_token = random. randint(0, len(vocab) - 1) 
mlm_input_tokens[mlm_pred_position] = masked_token 
pred_positions_and_labels. append( 
(mlm_pred_position, tokens[mlm_pred_position])) 
return mlm_input_tokens, pred_positions_and_labels 


By invoking the aforementioned _replace_mlm_tokens function, the following function takes a 
BERT input sequence (tokens) as an input and returns indices of the input tokens (after possible 
token replacement as described in Section 14.8.5), the token indices where predictions take place, 
and label indices for these predictions. 


#@save 
def _get_mlm_data_from_tokens(tokens, vocab): 
candidate_pred_positions = [] 
# ‘tokens* is a list of strings 
for i, token in enumerate(tokens): 
# Special tokens are not predicted in the masked language modeling 
# task 





(continues on next page) 
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if token in ['<cls>', '<sep>']: 
continue 

candidate_pred_positions.append(i) 
# 15% of random tokens are predicted in the masked language modeling task 
num_mlm_preds = max(1, round(len(tokens) * 0.15)) 
mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens( 

tokens, candidate_pred_positions, num_mlm_preds, vocab) 
pred_positions_and_labels = sorted(pred_positions_and_labels, 

key=lambda x: x[Q]) 

pred_positions = [v[0] for v in pred_positions_and_labels] 
mlm_pred_labels = [v[1] for v in pred_positions_and_labels] 
return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels] 


14.9.2 Transforming Text into the Pretraining Dataset 


Now we are almost ready to customize a Dataset class for pretraining BERT. Before that, we 
still need to define a helper function _pad_bert_inputs to append the special “<mask>” to- 
kens to the inputs. Its argument examples contain the outputs from the helper functions 
_get_nsp_data_from_paragraph and _get_mlm_data_from_tokens for the two pretraining tasks. 


#@save 
def _pad_bert_inputs(examples, max_len, vocab): 
max_num_mlm_preds = round(max_len * 0.15) 
all_token_ids, all_segments, valid_lens, = [1, [], [] 
all_pred_positions, all_mlm_weights, all_mlm_labels = [], [1, [] 
nsp_labels = [] 
for (token_ids, pred_positions, mlm_pred_label_ids, segments, 
is_next) in examples: 
all_token_ids.append(np.array(token_ids + [vocab[ '<pad>'1] * ( 
max_len - len(token_ids)), dtype='int32')) 
all_segments.append(np.array(segments + [0] * ( 
max_len - len(segments)), dtype='int32')) 
# ‘valid_lens* excludes count of '<pad>' tokens 
valid_lens.append(np. array(len(token_ids), dtype='float32')) 
all_pred_positions.append(np.array(pred_positions + [0] * ( 
max_num_mlm_preds - len(pred_positions)), dtype='int32')) 
# Predictions of padded tokens will be filtered out in the loss via 
# multiplication of 0 weights 
all_mlm_weights. append( 
np.array(L1.0] * len(mlm_pred_label_ids) + [0.0] * ( 
max_num_mlm_preds - len(pred_positions)), dtype='float32')) 
all_mlm_labels.append(np.array(mlm_pred_label_ids + [0] * ( 
max_num_mlm_preds - len(mlm_pred_label_ids)), dtype='int32')) 
nsp_labels.append(np.array(is_next)) 
return (all_token_ids, all_segments, valid_lens, all_pred_positions, 
all_mlm_weights, all_mlm_labels, nsp_labels) 


Putting the helper functions for generating training examples of the two pretraining tasks, and the 
helper function for padding inputs together, we customize the following _WikiTextDataset class 
as the WikiText-2 dataset for pretraining BERT. By implementing the __getitem__function, we 
can arbitrarily access the pretraining (masked language modeling and next sentence prediction) 
examples generated from a pair of sentences from the WikiText-2 corpus. 
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The original BERT model uses WordPiece embeddings whose vocabulary size is 30,000 (Wu et al., 
2016). The tokenization method of WordPiece is a slight modification of the original byte pair 
encoding algorithm in Section 14.6.2. For simplicity, we use the d21.tokenize function for tok- 
enization. Infrequent tokens that appear less than five times are filtered out. 


#@save 
class _WikiTextDataset(gluon.data.Dataset): 
def __init__(self, paragraphs, max_len): 
# Input ‘paragraphs[i]*‘ is a list of sentence strings representing a 
# paragraph; while output ‘paragraphs[i]*‘ is a list of sentences 
# representing a paragraph, where each sentence is a list of tokens 
paragraphs = [ 
d21.tokenize(paragraph, token='word') for paragraph in paragraphs 
] 
sentences = [ 
sentence for paragraph in paragraphs for sentence in paragraph 


] 
self.vocab = d21.Vocab( 
sentences, 
min_freg=5, 
reserved_tokens=['<pad>', '<mask>', '<cls>', '<sep>']) 


# Get data for the next sentence prediction task 

examples = [] 

for paragraph in paragraphs: 

examples .extend( 
get_nsp_data_from_paragraph(paragraph, paragraphs, self.vocab, 
max_len)) 
# Get data for the masked language model task 
examples = [(_get_mlm_data_from_tokens(tokens, self.vocab) + 
(segments, is_next)) 
for tokens, segments, is_next in examples] 

# Pad inputs 

(self.all_token_ids, self.all_segments, self.valid_lens, 
self.all_pred_positions, self.all_mlm_weights, self.all_mlm_labels, 
self.nsp_labels) = _pad_bert_inputs(examples, max_len, self.vocab) 








def __getitem__(self, idx): 
return (self.all_token_ids[idx], self.all_segments[idx], 
self.valid_lens[idx], self.all_pred_positions[idx], 
self.all_mlm_weightsLidx], self.all_ml1m_labels[idx], 
self.nsp_labels[idx]) 


def __len__(self): 
return len(self.all_token_ids) 


By using the _read_wiki function and the _WikiTextDataset class, we define the following 
load_data_wiki to download and WikiText-2 dataset and generate pretraining examples from it. 


#@save 
def load_data_wiki(batch_size, max_len): 
num_workers = d21.get_dataloader_workers() 
data_dir = d21.download_extract('wikitext-2', 'wikitext-2’) 
paragraphs = _read_wiki(data_dir) 
train_set = _WikiTextDataset(paragraphs, max_len) 
train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True, 


(continues on next page) 
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num_workers=num_workers) 
return train_iter, train_set.vocab 


Setting the batch size to 512 and the maximum length of a BERT input sequence to be 64, we 
print out the shapes of a minibatch of BERT pretraining examples. Note that in each BERT input 
sequence, 10 (64 x 0.15) positions are predicted for the masked language modeling task. 


batch_size, max_len = 512, 64 
train_iter, vocab = load_data_wiki(batch_size, max_len) 


for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X, 
mlm_Y, nsp_y) in train_iter: 
print(tokens_X.shape, segments_X.shape, valid_lens_x.shape, 
pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape, 
nsp_y. shape) 
break 


(512, 64) (512, 64) (512,) (512, 10) (512, 10) (512, 10) (512,) 


In the end, let us take a look at the vocabulary size. Even after filtering out infrequent tokens, it is 
still over twice larger than that of the PTB dataset. 


len(vocab) 


20256 


Summary 


e Comparing with the PTB dataset, the WikiText-2 dateset retains the original punctuation, 
case and numbers, and is over twice larger. 


e We can arbitrarily access the pretraining (masked language modeling and next sentence 
prediction) examples generated from a pair of sentences from the WikiText-2 corpus. 


Exercises 


1. For simplicity, the period is used as the only delimiter for splitting sentences. Try other sen- 
tence splitting techniques, such as the spaCy and NLTK. Take NLTK as an example. You need 
to install NLTK first: pip install nltk. In the code, first import nltk. Then, download the 
Punkt sentence tokenizer: nltk.download('punkt’). To split sentences such as sentences 
= 'This is great ! Why not ?’, invoking nltk.tokenize.sent_tokenize(sentences) will 
return a list of two sentence strings: ['This is great !', ‘Why not ?']. 


2. What is the vocabulary size if we do not filter out any infrequent token? 


Discussions?” 





2% https://discuss.d21.ai/t/389 
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14.10 Pretraining BERT 


With the BERT model implemented in Section 14.8 and the pretraining examples generated from 
the WikiText-2 dataset in Section 14.9, we will pretrain BERT on the WikiText-2 dataset in this 
section. 


from d21 import mxnet as d21 
from mxnet import autograd, gluon, init, np, npx 


npx.set_np() 


To start, we load the WikiText-2 dataset as minibatches of pretraining examples for masked lan- 
guage modeling and next sentence prediction. The batch size is 512 and the maximum length of 
a BERT input sequence is 64. Note that in the original BERT model, the maximum length is 512. 


batch_size, max_len = 512, 64 
train_iter, vocab = d21.load_data_wiki(batch_size, max_len) 


Downloading ../data/wikitext-2-v1.zip from https://s3.amazonaws.com/research.metamind. io/ 
wikitext/wikitext-2-vl.zip... 


14.10.1 Pretraining BERT 


The original BERT has two versions of different model sizes (Devlin et al., 2018). The base model 
(BERT pasg) uses 12 layers (Transformer encoder blocks) with 768 hidden units (hidden size) and 
12 self-attention heads. The large model (BERT Larce) uses 24 layers with 1024 hidden units and 
16 self-attention heads. Notably, the former has 110 million parameters while the latter has 340 
million parameters. For demonstration with ease, we define a small BERT, using 2 layers, 128 
hidden units, and 2 self-attention heads. 


net = d21.BERTModel(len(vocab), num_hiddens=128, ffn_num_hiddens=256, 
num_heads=2, num_layers=2, dropout=0.2) 

devices = d21.try_all_gpus() 

net.initialize(init.Xavier(), ctx=devices) 

loss = gluon.loss.SoftmaxCELoss() 


Before defining the training loop, we define a helper function _get_batch_loss_bert. Given the 
shard of training examples, this function computes the loss for both the masked language model- 
ing and next sentence prediction tasks. Note that the final loss of BERT pretraining is just the sum 
of both the masked language modeling loss and the next sentence prediction loss. 


#@save 
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X_shards, 
segments_X_shards, valid_lens_x_shards, 
pred_positions_X_shards, mlm_weights_X_shards, 
mlm_Y_shards, nsp_y_shards): 
mlm_ls, nsp_ls, ls = [], [1, [1 
for (tokens_X_shard, segments_X_shard, valid_lens_x_shard, 
pred_positions_X_shard, mlm_weights_X_shard, mlm_Y_shard, 
nsp_y_shard) in zip( 
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tokens_X_shards, segments_X_shards, valid_lens_x_shards, 
pred_positions_X_shards, mlm_weights_X_shards, mlm_Y_shards, 
nsp_y_shards): 

# Forward pass 

_, mlm_Y_hat, nsp_Y_hat = net( 


tokens_X_shard, segments_X_shard, valid_lens_x_shard.reshape(-1), 


pred_positions_X_shard) 
# Compute masked language model loss 
mlm_l = loss( 


mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y_shard.reshape(-1), 


mlm_weights_X_shard.reshape((-1, 1))) 
mlm_l = mlm_l.sum() / (mlm_weights_X_shard.sum() + 1e-8) 
# Compute next sentence prediction loss 
nsp_l = loss(nsp_Y_hat, nsp_y_shard) 
nsp_l = nsp_1.mean() 
mlm_ls.append(mlm_1) 
nsp_ls.append(nsp_1) 
ls.append(mlm_1 + nsp_1) 
npx.waitall() 
return mlm_ls, nsp_ls, ls 


Invoking the two aforementioned helper functions, the following train_bert function defines the 
procedure to pretrain BERT (net) on the WikiText-2 (train_iter) dataset. Training BERT can take 
very long. Instead of specifying the number of epochs for training as in the train_ch13 function 
(see Section 13.1), the input num_steps of the following function specifies the number of iteration 


steps for training. 


def train_bert(train_iter, net, loss, vocab_size, devices, num_steps): 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, 
£'learning_rate': le-3)) 
step, timer = 0, d21.Timer() 
animator = d21.Animator(xlabel='step', ylabel='loss’, 


xlim=[1, num_steps], legend=['mlm’, 'nsp'7) 


# Sum of masked language modeling losses, sum of next sentence prediction 


# losses, no. of sentence pairs, count 
metric = d21.Accumulator (4) 
num_steps_reached = False 
while step < num_steps and not num_steps_reached: 
for batch in train_iter: 
(tokens_X_shards, segments_X_shards, valid_lens_x_shards, 
pred_positions_X_shards, mlm_weights_X_shards, 


mlm_Y_shards, nsp_y_shards) = [gluon.utils.split_and_load( 


elem, devices, even_split=False) for elem in batch] 
timer.start() 
with autograd.record(): 

mlm_ls, nsp_ls, ls = _get_batch_loss_bert( 





net, loss, vocab_size, tokens_X_shards, segments_X_shards, 


valid_lens_x_shards, pred_positions_X_shards, 
mlm_weights_X_shards, mlm_Y_shards, nsp_y_shards) 
Kon ENIS: 
l.backward() 
trainer.step(1) 
mlm_l_mean = sum([float(1) for 1 in mlm_1s]) / len(mlm_1s) 
nsp_1_mean = sum([float(1) for 1 in nsp_1s]) / len(nsp_1s) 


(continues on next page) 





710 Chapter 14. Natural Language Processing: Pretraining 


(continued from previous page) 


metric.add(mlm_l_mean, nsp_l_mean, batch[@].shape[@], 1) 
timer.stop() 
animator.add(step + 1, 

(metric[0] / metric[3], metric[1] / metric[3])) 


step += 1 

if step == num_steps: 
num_steps_reached = True 
break 


print(f'MLM loss (metric[0] / metric[3]:.3f), ' 
f'NSP loss {metric[1] / metric[3]:.3f)') 
print(f'(metric[2] / timer.sum():.1f) sentence pairs/sec on 
f’{str(devices) }’) 


1 


We can plot both the masked language modeling loss and the next sentence prediction loss during 
BERT pretraining. 


train_bert(train_iter, net, loss, len(vocab), devices, 50) 


MLM loss 7.895, NSP loss 0.726 
7860.6 sentence pairs/sec on [gpu(0), gpu(1)] 


“To. O 
8 


loss 


10 20 30 40 50 
step 


14.10.2 Representing Text with BERT 


After pretraining BERT, we can use it to represent single text, text pairs, or any token in them. The 
following function returns the BERT (net) representations for all tokens in tokens_a and tokens_b. 


def get_bert_encoding(net, tokens_a, tokens_b=None): 
tokens, segments = d21.get_tokens_and_segments(tokens_a, tokens_b) 
token_ids = np.expand_dims(np.array(vocab[tokens], ctx=devices[0]), 

axis=0) 

segments = np.expand_dims(np.array(segments, ctx=devices[0]), axis=0) 
valid_len = np.expand_dims(np.array(len(tokens), ctx=devices[0]), axis=0) 
encoded_X, _, _ = net(token_ids, segments, valid_len) 
return encoded_X 
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Consider the sentence “a crane is flying”. Recall the input representation of BERT as discussed in 
Section 14.8.4. After inserting special tokens “<cls>” (used for classification) and “<sep>” (used for 
separation), the BERT input sequence has a length of six. Since zero is the index of the “<cls>” to- 
ken, encoded_text[:, 0, :] isthe BERT representation of the entire input sentence. To evaluate 
the polysemy token “crane”, we also print out the first three elements of the BERT representation 
of the token. 


tokens_a = ['a', ‘crane’, 'is', 'flying'] 
encoded_text = get_bert_encoding(net, tokens_a) 
# Tokens: '<cls>', 'a', ‘crane’, ‘is’, ‘flying’, '<sep>’ 


encoded_text_cls = encoded_text[:, 0, :] 
encoded_text_crane = encoded_text[:, 2, :] 
encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[0][: 3] 


(1, 6, 128), 
(1, 128), 
array([ 1.449986 , 1.0014055, -@.8294296], ctx=gpu(0))) 


Now consider a sentence pair “a crane driver came” and “he just left”. Similarly, encoded_pairL: , 
0, :] is the encoded result of the entire sentence pair from the pretrained BERT. Note that the 
first three elements of the polysemy token “crane” are different from those when the context is 
different. This supports that BERT representations are context-sensitive. 


tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], [’he’, ‘just’, 'left'] 
encoded_pair = get_bert_encoding(net, tokens_a, tokens_b) 
# Tokens: '<cls>', 'a', ‘crane’, ‘driver’, ‘came’, '<sep>', 'he', ‘just’, 


# ‘left’, '<sep>’ 

encoded_pair_cls = encoded_pair[:, 0, :] 

encoded_pair_crane = encoded_pair[:, 2, :] 

encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[0][:3] 


(CL, 110, 29). 
(ly 123), 
array(L 1.439362 , 1.064286 , -0.8315817], ctx=gpu(0))) 


In Chapter 15, we will fine-tune a pretrained BERT model for downstream natural language pro- 
cessing applications. 


Summary 
° The original BERT has two versions, where the base model has 110 million parameters and 
the large model has 340 million parameters. 
+ After pretraining BERT, we can use it to represent single text, text pairs, or any token in them. 


* In the experiment, the same token has different BERT representation when their contexts 
are different. This supports that BERT representations are context-sensitive. 
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Exercises 
1. Inthe experiment, we can see that the masked language modeling loss is significantly higher 
than the next sentence prediction loss. Why? 


2. Set the maximum length of a BERT input sequence to be 512 (same as the original BERT 
model). Use the configurations of the original BERT model such as BERT¡ArcE. Do you en- 
counter any error when running this section? Why? 


Discussions?%” 





207 https://discuss.d21.ai/t/390 
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15 Natural Language Processing: Ap- 
plications 


We have seen how to represent text tokens and train their representations in Chapter 14. Such 
pretrained text representations can be fed to various models for different downstream natural 
language processing tasks. 


This book does not intend to cover natural language processing applications in a comprehensive 
manner. Our focus is on how to apply (deep) representation learning of languages to addressing natu- 
ral language processing problems. Nonetheless, we have already discussed several natural language 
processing applications without pretraining in earlier chapters, just for explaining deep learning 
architectures. For instance, in Chapter 8, we have relied on RNNs to design language models to 
generate novella-like text. In Chapter 9 and Chapter 10, we have also designed models based on 
RNNs and attention mechanisms for machine translation. Given pretrained text representations, 
in this chapter, we will consider two more downstream natural language processing tasks: sen- 
timent analysis and natural language inference. These are popular and representative natural 
language processing applications: the former analyzes single text and the latter analyzes relation- 
ships of text pairs. 


4 1 
l 


Fig. 15.1: Pretrained text representations can be fed to various deep learning architectures for 
different downstream natural language processing applications. This chapter focuses on how to 
design models for different downstream natural language processing applications. 


As depicted in Fig. 15.1, this chapter focuses on describing the basic ideas of designing natural 
language processing models using different types of deep learning architectures, such as MLPs, 
CNNs, RNNs, and attention. Though it is possible to combine any pretrained text representations 
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with any architecture for either downstream natural language processing task in Fig. 15.1, we 
select a few representative combinations. Specifically, we will explore popular architectures based 
on RNNs and CNNs for sentiment analysis. For natural language inference, we choose attention 
and MLPs to demonstrate how to analyze text pairs. In the end, we introduce how to fine-tune 
a pretrained BERT model for a wide range of natural language processing applications, such as 
on a sequence level (single text classification and text pair classification) and a token level (text 
tagging and question answering). As a concrete empirical case, we will fine-tune BERT for natural 
language processing. 


As we have introduced in Section 14.8, BERT requires minimal architecture changes for a wide 
range of natural language processing applications. However, this benefit comes at the cost of fine- 
tuning a huge number of BERT parameters for the downstream applications. When space or time 
is limited, those crafted models based on MLPs, CNNs, RNNs, and attention are more feasible. In 
the following, we start by the sentiment analysis application and illustrate the model design based 
on RNNs and CNNs, respectively. 


15.1 Sentiment Analysis and the Dataset 


Text classification is a common task in natural language processing, which transforms a sequence 
of text of indefinite length into a category of text. It is similar to the image classification, the most 
frequently used application in this book, e.g., Section 18.9. The only difference is that, rather than 
an image, text classification’s example is a text sentence. 


This section will focus on loading data for one of the sub-questions in this field: using text senti- 
ment classification to analyze the emotions of the text’s author. This problem is also called senti- 
ment analysis and has a wide range of applications. For example, we can analyze user reviews of 
products to obtain user satisfaction statistics, or analyze user sentiments about market conditions 
and use it to predict future trends. 


from d21 import mxnet as d21 
from mxnet import np, npx 
import os 

npx.set_np() 


15.1.1 The Sentiment Analysis Dataset 


We use Stanford's Large Movie Review Dataset”% as the dataset for sentiment analysis. This 
dataset is divided into two datasets for training and testing purposes, each containing 25,000 movie 
reviews downloaded from IMDb. In each dataset, the number of comments labeled as “positive” 
and “negative” is equal. 





20 https://ai.stanford.edu/-amaas/data/sentiment/ 
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Reading the Dataset 


We first download this dataset to the “../data” path and extract it to “../data/aclImdb”. 


#@save 

d21.DATA_HUB[ 'aclImdb'] = ( 
'http://ai.stanford.edu/“amaas/data/sentiment/aclImdb_v1.tar.gz', 
"@1ada507287d82875905620988597833ad4e0903') 


data_dir = d21.download_extract('aclImdb’, ‘aclImdb’) 


Downloading ../data/aclImdb_v1.tar.gz from http://ai.stanford.edu/~amaas/data/sentiment/ 
<aclImdb_v1.tar.gz... 


Next, read the training and test datasets. Each example is a review and its corresponding label: 1 
indicates “positive” and 0 indicates “negative”. 


#@save 
def read_imdb(data_dir, is_train): 
data, labels = [], [] 
for label in ('pos', 'neg'): 
folder_name = os.path.join(data_dir, 'train' if is_train else 'test', 
label) 
for file in os.listdir(folder_name): 
with open(os.path.join(folder_name, file), 'rb') as f: 


review = f.read().decode('utf-8').replace('1n', '') 
data.append(review) 
labels.append(1 if label == 'pos' else 0) 


return data, labels 


train_data = read_imdb(data_dir, is_train=True) 

print(’# trainings:', len(train_data[0])) 

for x, y in zip(train_data[l0][:3], train_data[1][:3]): 
print('’label:’, y, 'review:*, x[0:60]) 


# trainings: 25000 

label: 1 review: Normally the best way to annoy me in a film is to include so 
label: 1 review: The Bible teaches us that the love of money is the root of a 
label: 1 review: Being someone who lists Night of the Living Dead at number t 


Tokenization and Vocabulary 


We use a word as a token, and then create a dictionary based on the training dataset. 


train_tokens = d21.tokenize(train_data[0], token='word’) 
vocab = d21.Vocab(train_tokens, min_freq=5, reserved_tokens=[ '<pad>']) 


d21.set_figsize() 
d21.p1t.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50)) 
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(array([ 553., 2373., 6820., 4834., 2817., 1848., 1380., 1005., 759., 
Siloa aoe 849. las 207... 1742, 133. ass Bb 
SIDA 
array([ 0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 
650, 700, 750, 800, 850, 900, 950]), 
<BarContainer object of 19 artists>) 


6000 


4000 


2000 
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Padding to the Same Length 


Because the reviews have different lengths, so they cannot be directly combined into minibatches. 
Here we fix the length of each comment to 500 by truncating or adding “<unk>” indices. 


num_steps = 500 + sequence length 
train_features = np.array([d21.truncate_pad( 


vocab[line], num_steps, vocab[ '<pad>']) for line in train_tokens]) 
print(train_features.shape) 


(25000, 500) 


Creating the Data Iterator 


Now, we will create a data iterator. Each iteration will return a minibatch of data. 
train_iter = d21.1oad_array((train_features, train_data[1]), 64) 


for X, y in train_iter: 
print('X:', X.shape, ', y:', y.shape) 
break 

print('* batches:', len(train_iter)) 


Me (4h, BOO) y Ys (Ay) 
# batches: 391 
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15.1.2 Putting All Things Together 


Last, we will save a function load_data_imdb into d21, which returns the vocabulary and data iter- 
ators. 


#@save 
def load_data_imdb(batch_size, num_steps=500): 
data_dir = d21.download_extract('aclImdb', 'aclImdb'> 
train_data = read_imdb(data_dir, True) 
test_data = read_imdb(data_dir, False) 
train_tokens = d21.tokenize(train_data[0], token='word’) 
test_tokens = d21.tokenize(test_data[0], token='’word’) 
vocab = d21.Vocab(train_tokens, min_freq=5) 
train_features = np.array([ 
d21.truncate_pad(vocab[line], num_steps, vocab['<pad>']) 
for line in train_tokens 
1) 
test_features = np.array([ 
d21.truncate_pad(vocab[line], num_steps, vocabl[ '<pad>']) 
for line in test_tokens 
1) 
train_iter = d21.1load_array((train_features, train_data[1]), batch_size) 
test_iter = d21.load_array((test_features, test_data[1]), 
batch_size, 
is_train=False) 
return train_iter, test_iter, vocab 


Summary 


e Text classification can classify a text sequence into a category. 


° To classify a text sentiment, we load an IMDb dataset and tokenize its words. Then we pad 
the text sequence for short reviews and create a data iterator. 


Exercises 


1. Discover a different natural language dataset (such as Amazon reviews”) and build a similar 
data_loader function as load_data_imdb. 


Discussions?1% 





20% https://snap.stanford.edu/data/web-Amazon.html 
210 https://discuss.d21.ai/t/391 
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15.2 Sentiment Analysis: Using Recurrent Neural Networks 


Similar to search synonyms and analogies, text classification is also a downstream application of 
word embedding. In this section, we will apply pre-trained word vectors (GloVe) and bidirectional 
recurrent neural networks with multiple hidden layers (Maas et al., 2011), as shown in Fig. 15.2.1. 
We will use the model to determine whether a text sequence of indefinite length contains positive 
or negative emotion. 


Application 





Fig. 15.2.1: This section feeds pretrained GloVe to an RNN-based architecture for sentiment anal- 
ysis. 


from d21 import mxnet as d21 

from mxnet import gluon, init, np, npx 
from mxnet.gluon import nn, rnn 
npx.set_np() 


batch_size = 64 
train_iter, test_iter, vocab = d21.load_data_imdb(batch_size) 


15.2.1 Using a Recurrent Neural Network Model 


In this model, each word first obtains a feature vector from the embedding layer. Then, we further 
encode the feature sequence using a bidirectional recurrent neural network to obtain sequence 
information. Finally, we transform the encoded sequence information to output through the fully 
connected layer. Specifically, we can concatenate hidden states of bidirectional long-short term 
memory in the initial time step and final time step and pass it to the output layer classification 
as encoded feature sequence information. In the BiRNN class implemented below, the Embedding 
instance is the embedding layer, the LSTM instance is the hidden layer for sequence encoding, and 
the Dense instance is the output layer for generated classification results. 


class BiRNN(nn.Block): 
def __init__(self, vocab_size, embed_size, num_hiddens, 
num_layers, **kwargs): 
super(BiRNN, self).__init__(**kwargs) 


(continues on next page) 
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(continued from previous page) 


self .embedding = nn.Embedding(vocab_size, embed_size) 

# Set ‘bidirectional* to True to get a bidirectional recurrent neural 

# network 

self.encoder = rnn.LSTM(num_hiddens, num_layers=num_layers, 
bidirectional=True, input_size=embed_size) 

self.decoder = nn.Dense(2) 


def forward(self, inputs): 
# The shape of ‘inputs* is (batch size, no. of words). Because LSTM 
# needs to use sequence as the first dimension, the input is 
# transformed and the word feature is then extracted. The output shape 
# is (no. of words, batch size, word vector dimension). 
embeddings = self.embedding(inputs.T) 
# Since the input (embeddings) is the only argument passed into 
# rnn.LSTM, it only returns the hidden states of the last hidden layer 
# at different time step (outputs). The shape of ‘outputs* is 
# (no. of words, batch size, 2 * no. of hidden units). 
outputs = self.encoder (embeddings) 
# Concatenate the hidden states of the initial time step and final 
# time step to use as the input of the fully connected layer. Its 
# shape is (batch size, 4 * no. of hidden units) 
encoding = np.concatenate((outputs[0], outputs[-1]), axis=1) 
outs = self.decoder (encoding) 
return outs 


Create a bidirectional recurrent neural network with two hidden layers. 
embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d21.try_all_gpus() 


net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers) 
net.initialize(init.Xavier(), ctx=devices) 


Loading Pre-trained Word Vectors 


Because the training dataset for sentiment classification is not very large, in order to deal with 
overfitting, we will directly use word vectors pre-trained on a larger corpus as the feature vectors 
of all words. Here, we load a 100-dimensional GloVe word vector for each word in the dictionary 
vocab. 


glove_embedding = d21.TokenEmbedding('glove.6b.100d’) 


Downloading ../data/glove.6B.100d.zip from http://d21-data.s3-accelerate.amazonaws.com/glove. 
>6B.100d.zip... 


Query the word vectors that in our vocabulary. 


embeds = glove_embedding[vocab.idx_to_token] 
embeds. shape 


(49346, 100) 
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Then, we will use these word vectors as feature vectors for each word in the reviews. Note that 
the dimensions of the pre-trained word vectors need to be consistent with the embedding layer 
output size embed_size in the created model. In addition, we no longer update these word vectors 
during training. 


net.embedding.weight.set_data(embeds) 
net.embedding.collect_params().setattr('grad_reg', ‘null’) 


Training and Evaluating the Model 


Now, we can start training. 


Ir, num_epochs = 0.01, 5 

trainer = gluon.Trainer(net.collect_params(), ‘adam’, {'learning_rate’: 1r} 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 

d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.296, train acc 0.876, test acc 0.857 
590.4 examples/sec on [gpu(@), gpu(1)] 


—— train loss 
=== train acc 


—-- test acc 





Finally, define the prediction function. 
#@save 
def predict_sentiment(net, vocab, sentence): 
sentence = np.array(vocab[sentence.split()], ctx=d21.try_gpu()) 


label = np.argmax(net(sentence.reshape(1, -1)), axis=1) 
return ‘positive’ if label == 1 else ‘negative’ 


Then, use the trained model to classify the sentiments of two simple sentences. 


predict_sentiment(net, vocab, 'this movie is so great’) 


‘positive’ 
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predict_sentiment(net, vocab, 'this movie is so bad’) 


"negative’ 


Summary 


+ Text classification transforms a sequence of text of indefinite length into a category of text. 
This is a downstream application of word embedding. 


+ We can apply pre-trained word vectors and recurrent neural networks to classify the emo- 
tions in a text. 


Exercises 


1. Increase the number of epochs. What accuracy rate can you achieve on the training and 
testing datasets? What about trying to re-tune other hyperparameters? 


2. Will using larger pre-trained word vectors, such as 300-dimensional GloVe word vectors, im- 
prove classification accuracy? 


3. Can we improve the classification accuracy by using the spaCy word tokenization tool? You 
need to install spaCy: pip install spacy and install the English package: python -m spacy 
download en. In the code, first import spacy: import spacy. Then, load the spacy English 
package: spacy_en = spacy.load('en'). Finally, define the function def tokenizer(text): 
return [tok.text for tok in spacy_en.tokenizer(text)] and replace the original tok- 
enizer function. It should be noted that GloVe's word vector uses “-” to connect each word 
when storing noun phrases. For example, the phrase “new york” is represented as “new- 
york” in GloVe. After using spaCy tokenization, “new york” may be stored as “new york”. 


Discussions?! 


15.3 Sentiment Analysis: Using Convolutional Neural Networks 


In Chapter 6, we explored how to process two-dimensional image data with two-dimensional con- 
volutional neural networks. In the previous language models and text classification tasks, we 
treated text data as a time series with only one dimension, and naturally, we used recurrent neu- 
ral networks to process such data. In fact, we can also treat text as a one-dimensional image, so 
that we can use one-dimensional convolutional neural networks to capture associations between 
adjacent words. As described in Fig. 15.3.1 This section describes a groundbreaking approach to 
applying convolutional neural networks to sentiment analysis: textCNN (Kim, 2014). 





21" https://discuss.d21.ai/t/392 
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Fig. 15.3.1: This section feeds pretrained GloVe to a CNN-based architecture for sentiment analy- 
sis. 


First, import the packages and modules required for the experiment. 


from d21 import mxnet as d21 

from mxnet import gluon, init, np, npx 
from mxnet.gluon import nn 
npx.set_np() 


batch_size = 64 
train_iter, test_iter, vocab = d21.load_data_imdb(batch_size) 


15.3.1 One-Dimensional Convolutional Layer 


Before introducing the model, let us explain how a one-dimensional convolutional layer works. 
Like a two-dimensional convolutional layer, a one-dimensional convolutional layer uses a one- 
dimensional cross-correlation operation. In the one-dimensional cross-correlation operation, the 
convolution window starts from the leftmost side of the input array and slides on the input array 
from left to right successively. When the convolution window slides to a certain position, the 
input subarray in the window and kernel array are multiplied and summed by element to get the 
element at the corresponding location in the output array. As shown in Fig. 15.3.2, the input is a 
one-dimensional array with a width of 7 and the width of the kernel array is 2. As we can see, the 
output width is 7 — 2 + 1 = 6 and the first element is obtained by performing multiplication by 
element on the leftmost input subarray with a width of 2 and kernel array and then summing the 
results. 


Input Kernel Output 
A + [a [<[s [e] > = W ee 


Fig. 15.3.2: One-dimensional cross-correlation operation. The shaded parts are the first output 
element as well as the input and kernel array elements used in its calculation: 0 x 1+ 1 x 2 =2. 
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Next, we implement one-dimensional cross-correlation in the corr1d function. It accepts the in- 
put array X and kernel array K and outputs the array Y. 


def corrid(X, K): 
w = K.shape[0] 
Y = np.zeros((X.shapel0] - w + 1)) 
for i in range(Y.shaper0]): 
YCi] = (XLi: i + w] * K).sum() 
return Y 


Now, we will reproduce the results of the one-dimensional cross-correlation operation in Fig. 
15232. 


X, K = np.array([@, 1, 2, 3, 4, 5, 6]), np.array([1, 2]) 
corrid(X, K) 


array Ls) Doy Gos Wiley Wb, 17/10) 


The one-dimensional cross-correlation operation for multiple input channels is also similar to 
the two-dimensional cross-correlation operation for multiple input channels. On each channel, 
it performs the one-dimensional cross-correlation operation on the kernel and its corresponding 
input and adds the results of the channels to get the output. Fig. 15.3.3 shows a one-dimensional 
cross-correlation operation with three input channels. 


Input Kernel Output 


- ERRE 





Fig. 15.3.3: One-dimensional cross-correlation operation with three input channels. The shaded 
parts are the first output element as well as the input and kernel array elements used in its calcu- 
lation: 0x1+1x2+1x34+2x4+2x (-1)+3 x (-3) =2. 


Now, we reproduce the results of the one-dimensional cross-correlation operation with multi- 
input channel in Fig. 15.3.3. 


def corrid_multi_in(X, K): 
# First, we traverse along the 0th dimension (channel dimension) of `X` 
# and ‘K*. Then, we add them together by using * to turn the result list 
# into a positional argument of the ‘add_n* function 
return sum(corrld(x, k) for x, k in zip(X, K)) 


X = np.array([[0, 1, 2, 3, 4, 5, 6], 
il, 25 8, 4 B, Oy Wl, 
[2, 3 “yy SB, 6, Ya EalaD) 
K =np.array(ill, 21, 13, 41, [-1, =311) 


corr1d_multi_in(X, K) 
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array IS A LO ZO 3251)) 


The definition of a two-dimensional cross-correlation operation tells us that a one-dimensional 
cross-correlation operation with multiple input channels can be regarded as a two-dimensional 
cross-correlation operation with a single input channel. As shown in Fig. 15.3.4, we can also 
present the one-dimensional cross-correlation operation with multiple input channels in Fig. 
15.3.3 as the equivalent two-dimensional cross-correlation operation with a single input channel. 
Here, the height of the kernel is equal to the height of the input. 


Input Kernel Output 


- ELFER 





Fig. 15.3.4: Two-dimensional cross-correlation operation with a single input channel. The high- 
lighted parts are the first output element and the input and kernel array elements used in its cal- 
culation: 2 x (—1) +3 x (-3)+1x3+2x4+0x1+1x2=2. 





Both the outputs in Fig. 15.3.2 and Fig. 15.3.3 have only one channel. We discussed how to specify 
multiple output channels in a two-dimensional convolutional layer in Section 6.4. Similarly, we 
can also specify multiple output channels in the one-dimensional convolutional layer to extend 
the model parameters in the convolutional layer. 


15.3.2 Max-Over-Time Pooling Layer 


Similarly, we have a one-dimensional pooling layer. The max-over-time pooling layer used in 
TextCNN actually corresponds to a one-dimensional global maximum pooling layer. Assuming 
that the input contains multiple channels, and each channel consists of values on different time 
steps, the output of each channel will be the largest value of all time steps in the channel. There- 
fore, the input of the max-over-time pooling layer can have different time steps on each channel. 


To improve computing performance, we often combine timing examples of different lengths into 
a minibatch and make the lengths of each timing example in the batch consistent by appending 
special characters (such as 0) to the end of shorter examples. Naturally, the added special char- 
acters have no intrinsic meaning. Because the main purpose of the max-over-time pooling layer 
is to capture the most important features of timing, it usually allows the model to be unaffected 
by the manually added characters. 
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15.3.3 The TextCNN Model 


TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer. 
Suppose the input text sequence consists of n words, and each word is represented by a d- 
dimension word vector. Then the input example has a width of n, a height of 1, and d input 
channels. The calculation of textCNN can be mainly divided into the following steps: 


1. Define multiple one-dimensional convolution kernels and use them to perform convolution 
calculations on the inputs. Convolution kernels with different widths may capture the cor- 
relation of different numbers of adjacent words. 


2. Perform max-over-time pooling on all output channels, and then concatenate the pooling 
output values of these channels in a vector. 


3. The concatenated vector is transformed into the output for each category through the fully 
connected layer. A dropout layer can be used in this step to deal with overfitting. 


No. of output sentiment polarities: 2 
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Fig. 15.3.5: TextCNN design. 


Fig. 15.3.5 gives an example to illustrate the textCNN. The input here is a sentence with 11 words, 
with each word represented by a 6-dimensional word vector. Therefore, the input sequence has a 
width of 11 and 6 input channels. We assume there are two one-dimensional convolution kernels 
with widths of 2 and 4, and 4and 5 output channels, respectively. Therefore, after one-dimensional 
convolution calculation, the width of the four output channels is 11 — 2+ 1 = 10, while the width 
of the other five channels is 11 — 4+ 1 = 8. Even though the width of each channel is different, we 
can still perform max-over-time pooling for each channel and concatenate the pooling outputs of 
the 9 channels into a 9-dimensional vector. Finally, we use a fully connected layer to transform 
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the 9-dimensional vector into a 2-dimensional output: positive sentiment and negative sentiment 
predictions. 


Next, we will implement a textCNN model. Compared with the previous section, in addition to 
replacing the recurrent neural network with a one-dimensional convolutional layer, here we use 
two embedding layers, one with a fixed weight and another that participates in training. 


class TextCNN(nn.Block): 
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels, 
xxkwargs): 

super(TextCNN, self).__init__(**kwargs) 

self .embedding = nn.Embedding(vocab_size, embed_size) 

# The embedding layer does not participate in training 

self.constant_embedding = nn.Embedding(vocab_size, embed_size) 

self.dropout = nn.Dropout(0.5) 

self.decoder = nn.Dense(2) 

# The max-over-time pooling layer has no weight, so it can share an 

# instance 

self.pool = nn.GlobalMaxPool11D() 

# Create multiple one-dimensional convolutional layers 

self.convs = nn.Sequential() 

for c, k in zip(num_channels, kernel_sizes): 
self.convs.add(nn.Conv1D(c, k, activation='relu’)) 


def forward(self, inputs): 

# Concatenate the output of two embedding layers with shape of 
(batch size, no. of words, word vector dimension) by word vector 
embeddings = np.concatenate(( 

self.embedding(inputs), self.constant_embedding(inputs)), axis=2) 
# According to the input format required by Conv1D, the word vector 
# dimension, that is, the channel dimension of the one-dimensional 
# convolutional layer, is transformed into the previous dimension 
embeddings = embeddings.transpose(0, 2, 1) 
# For each one-dimensional convolutional layer, after max-over-time 
# pooling, an ndarray with the shape of (batch size, channel size, 1) 
# can be obtained. Use the flatten function to remove the last 
# dimension and then concatenate on the channel dimension 
encoding = np.concatenate([ 

np.squeeze(self.pool(conv(embeddings)), axis=-1) 

for conv in self.convs], axis=1) 
# After applying the dropout method, use a fully connected layer to 
# obtain the output 
outputs = self.decoder(self.dropout (encoding) ) 
return outputs 


Create a TextCNN instance. It has 3 convolutional layers with kernel widths of 3, 4, and 5, all with 
100 output channels. 


embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100] 
devices = d21.try_all_gpus() 

net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels) 
net.initialize(init.Xavier(), ctx=devices) 
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Load Pre-trained Word Vectors 


As in the previous section, load pre-trained 100-dimensional GloVe word vectors and initialize the 
embedding layers embedding and constant_embedding. Here, the former participates in training 
while the latter has a fixed weight. 


glove_embedding = d21.TokenEmbedding('glove.6b.100d’) 

embeds = glove_embedding[ vocab. idx_to_token] 

net .embedding.weight.set_data(embeds) 
net.constant_embedding.weight.set_data(embeds) 
net.constant_embedding.collect_params().setattr('grad_req', ‘null’) 


Train and Evaluate the Model 


Now we can train the model. 
lr, num_epochs = 0.001, 5 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, {'’learning_rate’: 1r} 


loss = gluon.loss.SoftmaxCrossEntropyLoss() 
d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.090, train acc 0.968, test acc 0.862 
3772.3 examples/sec on [gpu(@), gpu(1)] 


1.0 


0.8 


0.6 — train loss 
=== train acc 


0.4 —:- test acc 


0.2 


0.0 





epoch 


Below, we use the trained model to classify sentiments of two simple sentences. 


d21.predict_sentiment(net, vocab, 'this movie is so great’) 


"positive’ 


d21.predict_sentiment(net, vocab, ‘this movie is so bad’) 
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"negative’ 


Summary 


e We can use one-dimensional convolution to process and analyze timing data. 


e A one-dimensional cross-correlation operation with multiple input channels can be re- 
garded as a two-dimensional cross-correlation operation with a single input channel. 


° The input of the max-over-time pooling layer can have different numbers of time steps on 
each channel. 


* TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling 
layer. 


Exercises 


1. Tune the hyperparameters and compare the two sentiment analysis methods, using recur- 
rent neural networks and using convolutional neural networks, as regards accuracy and op- 
erational efficiency. 


2. Can you further improve the accuracy of the model on the test set by using the three methods 
introduced in the previous section: tuning hyperparameters, using larger pre-trained word 
vectors, and using the spaCy word tokenization tool? 


3. What other natural language processing tasks can you use textCNN for? 
4, Add positional encoding in the input representations. Does it improve the performance? 


Discussions??? 


15.4 Natural Language Inference and the Dataset 


In Section 15.1, we discussed the problem of sentiment analysis. This task aims to classify a sin- 
gle text sequence into predefined categories, such as a set of sentiment polarities. However, when 
there is a need to decide whether one sentence can be inferred form another, or eliminate redun- 
dancy by identifying sentences that are semantically equivalent, knowing how to classify one text 
sequence is insufficient. Instead, we need to be able to reason over pairs of text sequences. 
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15.4.1 Natural Language Inference 


Natural language inference studies whether a hypothesis can be inferred from a premise, where both 
are a text sequence. In other words, natural language inference determines the logical relation- 
ship between a pair of text sequences. Such relationships usually fall into three types: 


+ Entailment: the hypothesis can be inferred from the premise. 
* Contradiction: the negation of the hypothesis can be inferred from the premise. 
e Neutral: all the other cases. 


Natural language inference is also known as the recognizing textual entailment task. For example, 
the following pair will be labeled as entailment because “showing affection” in the hypothesis can 
be inferred from “hugging one another” in the premise. 


Premise: Two women are hugging each other. 
Hypothesis: Two women are showing affection. 


The following is an example of contradiction as “running the coding example” indicates “not sleep- 
ing” rather than “sleeping”. 


Premise: A man is running the coding example from Dive into Deep Learning. 
Hypothesis: The man is sleeping. 


The third example shows a neutrality relationship because neither “famous” nor “not famous” can 
be inferred from the fact that “are performing for us”. 


Premise: The musicians are performing for us. 
Hypothesis: The musicians are famous. 


Natural language inference has been a central topic for understanding natural language. It en- 
joys wide applications ranging from information retrieval to open-domain question answering. 
To study this problem, we will begin by investigating a popular natural language inference bench- 
mark dataset. 


15.4.2 The Stanford Natural Language Inference (SNLI) Dataset 


Stanford Natural Language Inference (SNLI) Corpus is a collection of over 500, 000 labeled English 
sentence pairs (Bowman et al., 2015). We download and store the extracted SNLI dataset in the 
path ../data/snli_1.0. 


from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
import os 

import re 


npx.set_np() 

#@save 

d21.DATA_HUB[ 'SNLI'] = ( 
"https://nlp.stanford.edu/projects/snli/snli_1.®.zip’, 
'9fcde07509c7e87ec61c640c1b2753d9041758e4'> 


data_dir = d21.download_extract('SNLI'> 
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Downloading ../data/snli_1.0.zip from https://nl1lp.stanford.edu/projects/snli/snli_1.0.zip... 


Reading the Dataset 


The original SNLI dataset contains much richer information than what we really need in our ex- 
periments. Thus, we define a function read_snli to only extract part of the dataset, then return 
lists of premises, hypotheses, and their labels. 


#@save 
def read_snli(data_dir, is_train): 
"""Read the SNLI dataset into premises, hypotheses, and labels. 
def extract_text(s): 
# Remove information that will not be used by us 


nnn 


s = re.sub(’\\(', '', 5) 

s = re.sub('\\)', *?, s) 

# Substitute two or more consecutive whitespace with space 
s = re.sub('\\s{2,}’, * *, s) 


return s.strip() 
label_set = {'entailment’: 0, ‘contradiction’: 1, ‘neutral’: 2} 
file_name = os.path.join(data_dir, 'snli_1.0_train.txt' 

if is_train else 'snli_1.0_test.txt') 

with open(file_name, 'r') as f: 

rows = [row.split('1t') for row in f.readlines()[1:]] 
premises = [extract_text(row[1]) for row in rows if row[@] in label_set] 
hypotheses = [extract_text(row[2]) for row in rows if row[0] in label_set] 
labels = [label_set[rowLl0]] for row in rows if row[0] in label_set] 
return premises, hypotheses, labels 


Now let us print the first 3 pairs of premise and hypothesis, as well as their labels (“0”, “1”, and “2” 


66 


correspond to “entailment”, “contradiction”, and “neutral”, respectively ). 


train_data = read_snli(data_dir, is_train=True) 

for x0, x1, y in zip(train_data[l0][:3], train_data[1][:3], train_data[2][:3]): 
print('premise:', x0) 
print(’hypothesis:’, x1) 
print('label:', y) 


premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is training his horse for a competition . 
label: 2 

premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is at a diner , ordering an omelette . 
label: 1 

premise: A person on a horse jumps over a broken down airplane . 
hypothesis: A person is outdoors , on a horse . 

label: 0 


The training set has about 550, 000 pairs, and the testing set has about 10, 000 pairs. The following 


shows that the three labels “entailment”, “contradiction”, and “neutral” are balanced in both the 
training set and the testing set. 
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test_data = read_snli(data_dir, is_train=False) 
for data in [train_data, test_data]: 
print(L[row for row in data[2]].count(i) for i in range(3)]) 


[183416, 183187, 182764] 
E368 3237 T3219] 


Defining a Class for Loading the Dataset 


Below we define a class for loading the SNLI dataset by inheriting from the Dataset class in Gluon. 
The argument num_steps in the class constructor specifies the length of a text sequence so that 
each minibatch of sequences will have the same shape. In other words, tokens after the first 
num_steps ones in longer sequence are trimmed, while special tokens “<pad>” will be appended 
to shorter sequences until their length becomes num_steps. By implementing the __getitem_ 
function, we can arbitrarily access the premise, hypothesis, and label with the index idx. 


#@save 
class SNLIDataset(gluon.data.Dataset): 
""*"A customized dataset to load the SNLI dataset.””” 
def __init__(self, dataset, num_steps, vocab=None): 
self.num_steps = num_steps 
all_premise_tokens = d21.tokenize(dataset[0]) 
all_hypothesis_tokens = d21.tokenize(dataset[1]) 
if vocab is None: 
self.vocab = d21.Vocab(all_premise_tokens + all_hypothesis_tokens, 
min_freq=5, 
reserved_tokens=[ '<pad>’]) 
else: 
self.vocab = vocab 
self.premises = self._pad(all_premise_tokens) 
self .hypotheses = self._pad(all_hypothesis_tokens) 
self.labels = np.array(dataset[2]) 
print('read * + str(len(self.premises)) + * examples’) 


def _pad(self, lines): 
return np.array(L[ 
d21.truncate_pad(self.vocab[line], self.num_steps, 
self.vocab[ '<pad>'1) for line in lines 


1) 


def __getitem__(self, idx): 
return (self .premises[idx], self.hypotheses[idx]), self.labels[idx] 


def __len__(self): 
return len(self.premises) 
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Putting All Things Together 


Now we can invoke the read_sn1li function and the SNLIDataset class to download the SNLI dataset 
and return DataLoader instances for both training and testing sets, together with the vocabulary 
of the training set. It is noteworthy that we must use the vocabulary constructed from the training 
set as that of the testing set. As a result, any new token from the testing set will be unknown to 
the model trained on the training set. 


#@save 
def load_data_snli(batch_size, num_steps=50): 
"""Download the SNLI dataset and return data iterators and vocabulary. 
num_workers = d21.get_dataloader_workers() 
data_dir = d21.download_extract('SNLI'> 
train_data = read_snli(data_dir, True) 
test_data = read_snli(data_dir, False) 
train_set = SNLIDataset(train_data, num_steps) 
test_set = SNLIDataset(test_data, num_steps, train_set.vocab) 
train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True, 
num_workers=num_workers) 
test_iter = gluon.data.DataLoader(test_set, batch_size, shuffle=False, 
num_workers=num_workers) 
return train_iter, test_iter, train_set.vocab 


nnn 


Here we set the batch size to 128 and sequence length to 50, and invoke the load_data_snli func- 
tion to get the data iterators and vocabulary. Then we print the vocabulary size. 


train_iter, test_iter, vocab = load_data_snli(128, 50) 
len(vocab) 


read 549367 examples 
read 9824 examples 


18678 


Now we print the shape of the first minibatch. Contrary to sentiment analysis, we have 2 inputs 
X[0] and X[1] representing pairs of premises and hypotheses. 


for X, Y in train_iter: 
print(X[0].shape) 
print(X[1].shape) 
print(Y.shape) 
break 


(128, 50) 
(128, 50) 
(128,) 
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Summary 


e Natural language inference studies whether a hypothesis can be inferred from a premise, 
where both are a text sequence. 


e In natural language inference, relationships between premises and hypotheses include en- 
tailment, contradiction, and neutral. 


e Stanford Natural Language Inference (SNLI) Corpus is a popular benchmark dataset of nat- 
ural language inference. 


Exercises 


1. Machine translation has long been evaluated based on superficial n-gram matching between 
an output translation and a ground-truth translation. Can you design a measure for evaluat- 
ing machine translation results by using natural language inference? 


2. How can we change hyperparameters to reduce the vocabulary size? 


Discussions??? 


15.5 Natural Language Inference: Using Attention 


We introduced the natural language inference task and the SNLI dataset in Section 15.4. In view of 
many models that are based on complex and deep architectures, Parikh et al. proposed to address 
natural language inference with attention mechanisms and called it a “decomposable attention 
model” (Parikh et al., 2016). This results in a model without recurrent or convolutional layers, 
achieving the best result at the time on the SNLI dataset with much fewer parameters. In this 
section, we will describe and implement this attention-based method (with MLPs) for natural lan- 
guage inference, as depicted in Fig. 15.5.1. 





Fig. 15.5.1: This section feeds pretrained GloVe to an architecture based on attention and MLPs 
for natural language inference. 
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15.5.1 The Model 


Simpler than preserving the order of words in premises and hypotheses, we can just align words 
in one text sequence to every word in the other, and vice versa, then compare and aggregate such 
information to predict the logical relationships between premises and hypotheses. Similar to 
alignment of words between source and target sentences in machine translation, the alignment of 
words between premises and hypotheses can be neatly accomplished by attention mechanisms. 







Concat 
— — 
Aggregate --- sum then concat - - - -> Sum Sum 
i-i i-i 
- word - aligned words - -> do - am - 
need - tired - need sleep 
sleep - tired 
i am tired i need 


do sleep 





do 
f : am 
Premise Hypothesis awed 
tired 
i do need sleep iam tired sleep 





Fig. 15.5.2: Natural language inference using attention mechanisms. 


Fig. 15.5.2 depicts the natural language inference method using attention mechanisms. At a high 
level, it consists of three jointly trained steps: attending, comparing, and aggregating. We will 
illustrate them step by step in the following. 


from d21 import mxnet as d21 
from mxnet import gluon, init, np, npx 
from mxnet.gluon import nn 


npx.set_np() 


Attending 


The first step is to align words in one text sequence to each word in the other sequence. Suppose 
that the premise is “i do need sleep” and the hypothesis is “i am tired”. Due to semantical simi- 
larity, we may wish to align “i” in the hypothesis with “i” in the premise, and align “tired” in the 
hypothesis with “sleep” in the premise. Likewise, we may wish to align “i” in the premise with 
“1” in the hypothesis, and align “need” and “sleep” in the premise with “tired” in the hypothesis. 
Note that such alignment is soft using weighted average, where ideally large weights are associated 
with the words to be aligned. For ease of demonstration, Fig. 15.5.2 shows such alignment in a 
hard way. 
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Now we describe the soft alignment using attention mechanisms in more detail. Denote by A = 
(a¡,...,am) and B = (b;,,...,b,,) the premise and hypothesis, whose number of words are m and 
n, respectively, where a;,b; € Rf (i = 1,...,m,j =1,...,n) is a d-dimensional word embedding 
vector. For soft alignment, we compute the attention weights e;; € R as 


eij = f(ai)' f (by), (15.5.1) 


where the function f is an MLP defined in the following mlp function. The output dimension of f 
is specified by the num_hiddens argument of mlp. 


def mlp(num_hiddens, flatten): 
net = nn.Sequential() 
net .add(nn.Dropout(@. 2)) 
net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten)) 
net.add(nn.Dropout(0.2)) 
net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten)) 
return net 


It should be highlighted that, in (15.5.1) f takes inputs a; and b; separately rather than takes a pair 
of them together as the input. This decomposition trick leads to only m + n applications (linear 
complexity) of f rather than mn applications (quadratic complexity). 


Normalizing the attention weights in (15.5.1), we compute the weighted average of all the word 
embeddings in the hypothesis to obtain representation of the hypothesis that is softly aligned with 
the word indexed by i in the premise: 


exp(e;;) 


B= Y) 
= > k= EXP (eix) 


b,. (15.5.2) 


Likewise, we compute soft alignment of premise words for each word indexed by j in the hypoth- 
esis: 


exp(e;;) 
=>. "yj. (15.5.3) 
-5 X 


k=1 EXp(exj) 


Below we define the Attend class to compute the soft alignment of hypotheses (beta) with input 
premises A and soft alignment of premises (alpha) with input hypotheses B. 


class Attend(nn.Block): 
def __init__(self, num_hiddens, xxkwargs): 


super (Attend, self).__init__(**kwargs) 
self.f = mlp(num_hiddens=num_hiddens, flatten=False) 


def forward(self, A, B): 
# Shape of `A`/`B`: (b'atch_size', no. of words in sequence A/B, 
# 'embed_size') 
# Shape of 'f_A'/'f_B': (‘batch_size‘, no. of words in sequence A/B, 
# ‘num_hiddens* ) 
f_A = self.f(A) 
f_B = self.f(B) 
# Shape of ‘e‘: (‘batch_size*, no. of words in sequence A, 
# no. of words in sequence B) 
e = npx.batch_dot(f_A, f_B, transpose_b=True) 
# Shape of ‘beta’: ('batch_size', no. of words in sequence A, 


(continues on next page) 
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(continued from previous page) 
# 'embed_size'), where sequence B is softly aligned with each word 
# (axis 1 of 'beta') in sequence A 
beta = npx.batch_dot(npx.softmax(e), B) 
# Shape of ‘alpha*‘: ('batch_size', no. of words in sequence B, 
$ 'embed_size'), where sequence A is softly aligned with each word 
# (axis 1 of ‘alpha‘) in sequence B 
alpha = npx.batch_dot(npx.softmax(e.transpose(0, 2, 1)), A) 
return beta, alpha 


Comparing 


In the next step, we compare a word in one sequence with the other sequence that is softly aligned 
with that word. Note that in soft alignment, all the words from one sequence, though with prob- 
ably different attention weights, will be compared with a word in the other sequence. For easy of 
demonstration, Fig. 15.5.2 pairs words with aligned words in a hard way. For example, suppose 
that the attending step determines that “need” and “sleep” in the premise are both aligned with 
“tired” in the hypothesis, the pair “tired-need sleep” will be compared. 


In the comparing step, we feed the concatenation (operator |-, -]) of words from one sequence and 
aligned words from the other sequence into a function g (an MLP): 


VAi = g([a;, B,)),1 = 1, vee M 

VB; = g([b;, aj)),j = 1, Hiss Ua 

In (15.5.4), va ¡ is the comparison between word i in the premise and all the hypothesis words that 
are softly aligned with word i; while vg; is the comparison between word j in the hypothesis and 


all the premise words that are softly aligned with word j. The following Compare class defines such 
as comparing step. 


(15.5.4) 


class Compare(nn.Block): 
def __init__(self, num_hiddens, **kwargs): 
super(Compare, self).__init__(**kwargs) 
self.g = mlp(num_hiddens=num_hiddens, flatten=False) 


def forward(self, A, B, beta, alpha): 
V_A = self.g(np.concatenate([A, beta], axis=2)) 
V_B = self.g(np.concatenate([B, alpha], axis=2)) 
return EA V_B 


Aggregating 


With two sets of comparison vectors v4; (i = 1,...,m) and vg; (j = 1,...,n) on hand, in the last 
step we will aggregate such information to infer the logical relationship. We begin by summing 
up both sets: 


m n 
Va = Y vas Vg = X vay: (15.5.5) 
i=1 j=1 


Next we feed the concatenation of both summarization results into function h (an MLP) to obtain 
the classification result of the logical relationship: 


y = h([v4,vB]). (15.5.6) 
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The aggregation step is defined in the following Aggregate class. 


class Aggregate(nn.Block): 


def 


def 


__init__(self, num_hiddens, num_outputs, **kwargs): 
super (Aggregate, self).__init__(**kwargs) 

self.h = mlp(num_hiddens=num_hiddens, flatten=True) 
self .h.add(nn.Dense(num_outputs) ) 


forward(self, V_A, V_B): 

# Sum up both sets of comparison vectors 

V_A = V_A.sum(axis=1) 

V_B = V_B.sum(axis=1) 

# Feed the concatenation of both summarization results into an MLP 
Y_hat = self.h(np.concatenate([V_A, V_B], axis=1)) 

return Y_hat 


Putting All Things Together 


By putting the attending, comparing, and aggregating steps together, we define the decomposable 
attention model to jointly train these three steps. 


class DecomposableAttention(nn.Block): 


def 


def 


__init__(self, vocab, embed_size, num_hiddens, **kwargs): 

super (DecomposableAttention, self).__init__(**kwargs) 

self.embedding = nn.Embedding(len(vocab), embed_size) 

self.attend = Attend(num_hiddens) 

self.compare = Compare(num_hiddens) 

# There are 3 possible outputs: entailment, contradiction, and neutral 
self.aggregate = Aggregate(num_hiddens, 3) 


forward(self, X): 

premises, hypotheses = X 

A = self.embedding(premises) 

B = self.embedding (hypotheses) 

beta, alpha = self.attend(A, B) 

V_A, V_B = self.compare(A, B, beta, alpha) 
Y_hat = self.aggregate(V_A, V_B) 

return Y_hat 
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15.5.2 Training and Evaluating the Model 


Now we will train and evaluate the defined decomposable attention model on the SNLI dataset. 
We begin by reading the dataset. 


Reading the dataset 


We download and read the SNLI dataset using the function defined in Section 15.4. The batch size 
and sequence length are set to 256 and 50, respectively. 


batch_size, num_steps = 256, 50 
train_iter, test_iter, vocab = d21.load_data_snli(batch_size, num_steps) 


read 549367 examples 
read 9824 examples 


Creating the Model 


We use the pretrained 100-dimensional GloVe embedding to represent the input tokens. Thus, we 
predefine the dimension of vectors a; and b; in (15.5.1) as 100. The output dimension of func- 
tions f in (15.5.1) and g in (15.5.4) is set to 200. Then we create a model instance, initialize its 
parameters, and load the GloVe embedding to initialize vectors of input tokens. 


embed_size, num_hiddens, devices = 100, 200, d21.try_all_gpus() 
net = DecomposableAttention(vocab, embed_size, num_hiddens) 
net.initialize(init.Xavier(), ctx=devices) 

glove_embedding = d21.TokenEmbedding('glove.6b.100d’) 

embeds = glove_embedding[ vocab. idx_to_token] 

net. embedding.weight.set_data(embeds) 


Training and Evaluating the Model 


In contrast to the split_batch function in Section 12.5 that takes single inputs such as text se- 
quences (or images), we define a split_batch_multi_inputs function to take multiple inputs such 
as premises and hypotheses in minibatches. 


#@save 
def split_batch_multi_inputs(X, y, devices): 
"""Split multi-input ‘X* and ‘y* into multiple devices. 
X = list(zip(*[gluon.utils.split_and_load( 
feature, devices, even_split=False) for feature in X])) 
return (X, gluon.utils.split_and_load(y, devices, even_split=False)) 


nnn 


Now we can train and evaluate the model on the SNLI dataset. 


lr, num_epochs = 0.001, 4 
trainer = gluon.Trainer(net.collect_params(), ‘adam’, {'learning_rate': 1r} 
loss = gluon.loss.SoftmaxCrossEntropyLoss() 


(continues on next page) 
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(continued from previous page) 


d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices, 
split_batch_multi_inputs) 


loss 0.513, train acc 0.797, test acc 0.816 
9266.3 examples/sec on [gpu(0), gpu(1)] 


1.0 


0.8 + 


0.6 


0.4 


—— train loss 
0.2 + ==- train acc 
—-- test acc 





0.0 
10 15 20 25 3.0 3.5 4.0 


epoch 


Using the Model 


Finally, define the prediction function to output the logical relationship between a pair of premise 
and hypothesis. 


#@save 
def predict_snli(net, vocab, premise, hypothesis): 
premise = np.array(vocab[premise], ctx=d21.try_gpu()) 
hypothesis = np.array(vocab[hypothesis], ctx=d21.try_gpu()) 
label = np.argmax(net([premise.reshape((1, -1)), 
hypothesis.reshape((1, -1))]), axis=1) 
return 'entailment' if label == @ else ‘contradiction’ if label == 1 \ 
else ‘neutral’ 


We can use the trained model to obtain the natural language inference result for a sample pair of 
sentences. 


Dat 


predict_snli(net, vocab, ['he', ‘is’, ‘good’, '.’], [’he’, ‘is’, ‘bad’, '.']) 


contradiction’ 
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Summary 


The decomposable attention model consists of three steps for predicting the logical relation- 
ships between premises and hypotheses: attending, comparing, and aggregating. 


With attention mechanisms, we can align words in one text sequence to every word in the 
other, and vice versa. Such alignment is soft using weighted average, where ideally large 
weights are associated with the words to be aligned. 


The decomposition trick leads to a more desirable linear complexity than quadratic com- 
plexity when computing attention weights. 


We can use pretrained word embedding as the input representation for downstream natural 
language processing task such as natural language inference. 


Exercises 


1. Train the model with other combinations of hyperparameters. Can you get better accuracy 
on the test set? 


2. What are major drawbacks of the decomposable attention model for natural language infer- 
ence? 


3. Suppose that we want to get the level of semantical similarity (e.g., a continuous value be- 
tween 0 and 1) for any pair of sentences. How shall we collect and label the dataset? Can you 
design a model with attention mechanisms? 


Discussions?! 


15.6 Fine-Tuning BERT for Sequence-Level and Token-Level Applications 


In the previous sections of this chapter, we have designed different models for natural language 
processing applications, such as based on RNNs, CNN,s, attention, and MLPs. These models are 
helpful when there is space or time constraint, however, crafting a specific model for every natu- 
ral language processing task is practically infeasible. In Section 14.8, we introduced a pretraining 
model, BERT, that requires minimal architecture changes for a wide range of natural language 
processing tasks. One one hand, at the time of its proposal, BERT improved the state of the art 
on various natural language processing tasks. On the other hand, as noted in Section 14.10, the 
two versions of the original BERT model come with 110 million and 340 million parameters. Thus, 
when there are sufficient computational resources, we may consider fine-tuning BERT for down- 
stream natural language processing applications. 


In the following, we generalize a subset of natural language processing applications as sequence- 
level and token-level. On the sequence level, we introduce how to transform the BERT represen- 
tation of the text input to the output label in single text classification and text pair classification 
or regression. On the token level, we will briefly introduce new applications such as text tagging 
and question answering and shed light on how BERT can represent their inputs and get trans- 
formed into output labels. During fine-tuning, the “minimal architecture changes” required by 
BERT across different applications are the extra fully-connected layers. During supervised learn- 
ing of a downstream application, parameters of the extra layers are learned from scratch while all 
the parameters in the pretrained BERT model are fine-tuned. 
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15.6.1 Single Text Classification 


Single text classification takes a single text sequence as the input and outputs its classification re- 
sult. Besides sentiment analysis that we have studied in this chapter, the Corpus of Linguistic 
Acceptability (CoLA) is also a dataset for single text classification, judging whether a given sen- 
tence is grammatically acceptable or not (Warstadt et al., 2019). For instance, “I should study.” is 
acceptable but “I should studying.” is not. 


Label 


f 


REP<cis> Rep, Rep» Rep; Rep, Reps Reps REP <sep> 
BERT 
<cls> Token, Token, Tokens Token, Tokens Tokeng <sep> 
> = `~ wa = ~ a N / r om - -* 
SS x / 1 -7 
~~ a 3 Il > E -7 


Single text sequence 


Fig. 15.6.1: Fine-tuning BERT for single text classification applications, such as sentiment analysis 
and testing linguistic acceptability. Suppose that the input single text has six tokens. 


Section 14.8 describes the input representation of BERT. The BERT input sequence unambiguously 
represents both single text and text pairs, where the special classification token “<cls>” is used for 
sequence classification and the special classification token “<sep>” marks the end of single text 
or separates a pair of text. As shown in Fig. 15.6.1, in single text classification applications, the 
BERT representation of the special classification token “<cls>” encodes the information of the 
entire input text sequence. As the representation of the input single text, it will be fed into a small 
MLP consisting of fully-connected (dense) layers to output the distribution of all the discrete label 
values. 


15.6.2 Text Pair Classification or Regression 


We have also examined natural language inference in this chapter. It belongs to text pair classifi- 
cation, a type of application classifying a pair of text. 


Taking a pair of text as the input but outputting a continuous value, semantic textual similarity 
is a popular text pair regression task. This task measures semantic similarity of sentences. For 
instance, in the Semantic Textual Similarity Benchmark dataset, the similarity score of a pair of 
sentences is an ordinal scale ranging from 0 (no meaning overlap) to 5 (meaning equivalence) (Cer 
et al., 2017). The goal is to predict these scores. Examples from the Semantic Textual Similarity 
Benchmark dataset include (sentence 1, sentence 2, similarity score): 


+ “A plane is taking off.”, “An air plane is taking off.”, 5.000; 


+ “A woman is eating something.”, “A woman is eating meat.”, 3.000; 
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+ “A woman is dancing.”, “A man is talking.”, 0.000. 


Label 
REP<cis> Rep, Rep» REP <sep> Reps Rep, Reps REP <sep> 
BERT 
<cls> Token, Token, <sep> Token3 Token, Token; <sep> 
N / XN I Y 
\ / Me I Peá 
Nx / sn los 
Text sequence 1 Text sequence 2 


Fig. 15.6.2: Fine-tuning BERT for text pair classification or regression applications, such as natural 
language inference and semantic textual similarity. Suppose that the input text pair has two and 
three tokens. 


Comparing with single text classification in Fig. 15.6.1, fine-tuning BERT for text pair classifica- 
tion in Fig. 15.6.2 is different in the input representation. For text pair regression tasks such as 
semantic textual similarity, trivial changes can be applied such as outputting a continuous label 
value and using the mean squared loss: they are common for regression. 


15.6.3 Text Tagging 


Now let us consider token-level tasks, such as text tagging, where each token is assigned a label. 
Among text tagging tasks, part-of-speech tagging assigns each word a part-of-speech tag (e.g., ad- 
jective and determiner) according to the role of the word in the sentence. For example, according 
to the Penn Treebank II tag set, the sentence “John Smith 's car is new” should be tagged as “NNP 
(noun, proper singular) NNP POS (possessive ending) NN (noun, singular or mass) VB (verb, base 
form) JJ (adjective)”. 
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Label Label Label Label Label Label 


Parameters 
of Dense 
are shared 
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Single text sequence 


Fig. 15.6.3: Fine-tuning BERT for text tagging applications, such as part-of-speech tagging. Sup- 
pose that the input single text has six tokens. 


Fine-tuning BERT for text tagging applications is illustrated in Fig. 15.6.3. Comparing with Fig. 
15.6.1, the only distinction lies in that in text tagging, the BERT representation of every token of the 
input text is fed into the same extra fully-connected layers to output the label of the token, such 
as a part-of-speech tag. 


15.6.4 Question Answering 


As another token-level application, question answering reflects capabilities of reading comprehen- 
sion. For example, the Stanford Question Answering Dataset (SQUAD v1.1) consists of reading 
passages and questions, where the answer to every question is just a segment of text (text span) 
from the passage that the question is about (Rajpurkar et al., 2016). To explain, consider a passage 
“Some experts report that a mask's efficacy is inconclusive. However, mask makers insist that 
their products, such as N95 respirator masks, can guard against the virus.” and a question “Who 
say that N95 respirator masks can guard against the virus?”. The answer should be the text span 
“mask makers” in the passage. Thus, the goal in SQUAD v1.1 is to predict the start and end of the 
text span in the passage given a pair of question and passage. 
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Label 


Parameters of Dense 
are shared for the 
same label value 





f | 


REP <cis> Rep, Rep, REP <sep> Rep; Rep, Reps REP <sep> 
BERT 
<cls> Token, Token, <sep> Tokens Token, Tokens <sep> 
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Fig. 15.6.4: Fine-tuning BERT for question answering. Suppose that the input text pair has two 
and three tokens. 


To fine-tune BERT for question answering, the question and passage are packed as the first and 
second text sequence, respectively, in the input of BERT. To predict the position of the start of 
the text span, the same additional fully-connected layer will transform the BERT representation 
of any token from the passage of position i into a scalar score s;. Such scores of all the passage 
tokens are further transformed by the softmax operation into a probability distribution, so that 
each token position i in the passage is assigned a probability p; of being the start of the text span. 
Predicting the end of the text span is the same as above, except that parameters in its additional 
fully-connected layer are independent from those for predicting the start. When predicting the 
end, any passage token of position i is transformed by the same fully-connected layer into a scalar 
score e;. Fig. 15.6.4 depicts fine-tuning BERT for question answering. 


For question answering, the supervised learning’s training objective is as straightforward as maxi- 
mizing the log-likelihoods of the ground-truth start and end positions. When predicting the span, 
we can compute the score s; + e; for a valid span from position i to position j (i < j), and output 
the span with the highest score. 


Summary 


+ BERT requires minimal architecture changes (extra fully-connected layers) for sequence- 
level and token-level natural language processing applications, such as single text classifi- 
cation (e.g., sentiment analysis and testing linguistic acceptability), text pair classification 
or regression (e.g., natural language inference and semantic textual similarity), text tagging 
(e.g., part-of-speech tagging), and question answering. 


e During supervised learning of a downstream application, parameters of the extra layers are 
learned from scratch while all the parameters in the pretrained BERT model are fine-tuned. 
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Exercises 


1. Let us design a search engine algorithm for news articles. When the system receives an 
query (e.g., “oil industry during the coronavirus outbreak”), it should return a ranked list 
of news articles that are most relevant to the query. Suppose that we have a huge pool of 
news articles and a large number of queries. To simplify the problem, suppose that the 
most relevant article has been labeled for each query. How can we apply negative sampling 
(see Section 14.2.1) and BERT in the algorithm design? 


2. How can we leverage BERT in training language models? 
3. Can we leverage BERT in machine translation? 


Discussions?!? 


15.7 Natural Language Inference: Fine-Tuning BERT 


In earlier sections of this chapter, we have designed an attention-based architecture (in Section 
15.5) for the natural language inference task on the SNLI dataset (as described in Section 15.4). 
Now we revisit this task by fine-tuning BERT. As discussed in Section 15.6, natural language infer- 
ence is a sequence-level text pair classification problem, and fine-tuning BERT only requires an 
additional MLP-based architecture, as illustrated in Fig. 15.7.1. 


Application 





Fig. 15.7.1: This section feeds pretrained BERT to an MLP-based architecture for natural language 
inference. 


In this section, we will download a pretrained small version of BERT, then fine-tune it for natural 
language inference on the SNLI dataset. 


from d21 import mxnet as d21 
import json 

import multiprocessing 

from mxnet import gluon, np, npx 
from mxnet.gluon import nn 


(continues on next page) 
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(continued from previous page) 


import os 


npx.set_np() 


15.7.1 Loading Pretrained BERT 


We have explained how to pretrain BERT on the WikiText-2 dataset in Section 14.9 and Section 
14.10 (note that the original BERT model is pretrained on much bigger corpora). As discussed in 
Section 14.10, the original BERT model has hundreds of millions of parameters. In the following, 
we provide two versions of pretrained BERT: “bert.base” is about as big as the original BERT base 
model that requires a lot of computational resources to fine-tune, while “bert.small” is a small 
version to facilitate demonstration. 


d21.DATA_HUB[ 'bert.base'] = (d21.DATA_URL + 'bert.base.zip', 
"7b3820b35da691042e5d34c0971lac3edbd80d3f4') 
d21.DATA_HUB[ 'bert.smal1'] = (d21.DATA_URL + 'bert.small.zip’, 
"a4e718a47137ccd1809c9107ab4f5edd317bae2c') 


Either pretrained BERT model contains a “vocab.json” file that defines the vocabulary set 
and a “pretrained.params” file of the pretrained parameters. We implement the following 
load_pretrained_model function to load pretrained BERT parameters. 


def load_pretrained_model(pretrained_model, num_hiddens, ffn_num_hiddens, 

num_heads, num_layers, dropout, max_len, devices): 

data_dir = d21.download_extract(pretrained_model) 

# Define an empty vocabulary to load the predefined vocabulary 

vocab = d21.Vocab() 

vocab. idx_to_token = json.load(open(os.path. join(data_dir, ‘vocab. json'))) 

vocab. token_to_idx = {token: idx for idx, token in enumerate( 

vocab. idx_to_token) } 

bert = d21.BERTModel(len(vocab), num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout, max_len) 

# Load pretrained BERT parameters 

bert.load_parameters(os.path.join(data_dir, 'pretrained.params'), 
ctx=devices) 

return bert, vocab 


To facilitate demonstration on most of machines, we will load and fine-tune the small version 
(“bert.small”) of the pretrained BERT in this section. In the exercise, we will show how to fine- 
tune the much larger “bert.base” to significantly improve the testing accuracy. 


devices = d21.try_all_gpus() 

bert, vocab = load_pretrained_model ( 
"bert.small', num_hiddens=256, ffn_num_hiddens=512, num_heads=4, 
num_layers=2, dropout=0.1, max_len=512, devices=devices) 


Downloading ../data/bert.small.zip from http://d21-data.s3-accelerate.amazonaws.com/bert. 
<small.zip... 
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15.7.2 The Dataset for Fine-Tuning BERT 


For the downstream task natural language inference on the SNLI dataset, we define a customized 
dataset class SNLIBERTDataset. In each example, the premise and hypothesis form a pair of text 
sequence and is packed into one BERT input sequence as depicted in Fig. 15.6.2. Recall Section 
14.8.4 that segment IDs are used to distinguish the premise and the hypothesis in a BERT input se- 


quence. 


With the predefined maximum length of a BERT input sequence (max_len), the last token 


of the longer of the input text pair keeps getting removed until max_len is met. To accelerate gen- 
eration of the SNLI dataset for fine-tuning BERT, we use 4 worker processes to generate training 
or testing examples in parallel. 


class SNLIBERTDataset(gluon.data.Dataset): 


def 


def 


def 


def 


__init__(self, dataset, max_len, vocab=None): 

all_premise_hypothesis_tokens = [[ 
p_tokens, h_tokens 

] for p_tokens, h_tokens in zip(«L 
d21.tokenize([s.lower() for s in sentences]) 
for sentences in dataset[: 2] 


DD] 


self.labels = np.array(dataset[2]) 

self.vocab = vocab 

self.max_len = max_len 

(self.all_token_ids, self.all_segments, 

self.valid_lens) = self._preprocess(all_premise_hypothesis_tokens) 
print('read * + str(len(self.all_token_ids)) + ' examples’) 


_preprocess(self, all_premise_hypothesis_tokens): 

pool = multiprocessing.Pool(4) + Use 4 worker processes 

out = pool.map(self._mp_worker, all_premise_hypothesis_tokens) 

all_token_ids = [token_ids for token_ids, segments, valid_len in out] 

all_segments = [segments for token_ids, segments, valid_len in out] 

valid_lens = [valid_len for token_ids, segments, valid_len in out] 

return (np.array(all_token_ids, dtype='int32'), 
np.array(all_segments, dtype='int32'), np.array(valid_lens)) 


_mp_worker(self, premise_hypothesis_tokens): 
p_tokens, h_tokens = premise_hypothesis_tokens 
self._truncate_pair_of_tokens(p_tokens, h_tokens) 
tokens, segments = d21.get_tokens_and_segments(p_tokens, h_tokens) 
token_ids = self.vocab[tokens] + [self.vocab[’<pad>']] \ 

* (self.max_len - len(tokens)) 
segments = segments + [0] * (self.max_len - len(segments)) 
valid_len = len(tokens) 
return token_ids, segments, valid_len 


_truncate_pair_of_tokens(self, p_tokens, h_tokens): 
# Reserve slots for '<CLS>', '<SEP>', and '<SEP>' tokens for the BERT 
$ input 
while len(p_tokens) + len(h_tokens) > self.max_len - 3: 
if len(p_tokens) > len(h_tokens): 
p_tokens.pop() 
else: 
h_tokens.pop() 


(continues on next page) 
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(continued from previous page) 
def __getitem__(self, idx): 


return (self.all_token_ids[idx], self.all_segments[idx], 
self .valid_lens[idx]), self.labels[idx] 


def __len__(self): 
return len(self.all_token_ids) 


After downloading the SNLI dataset, we generate training and testing examples by instantiating 
the SNLIBERTDataset class. Such examples will be read in minibatches during training and testing 
of natural language inference. 


# Reduce ‘batch_size* if there is an out of memory error. In the original BERT 

# model, ‘max_len* = 512 

batch_size, max_len, num_workers = 512, 128, d21.get_dataloader_workers() 

data_dir = d21.download_extract('SNLI'> 

train_set = SNLIBERTDataset(d21.read_snli(data_dir, True), max_len, vocab) 

test_set = SNLIBERTDataset(d21.read_snli(data_dir, False), max_len, vocab) 

train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True, 
num_workers=num_workers) 

test_iter = gluon.data.DataLoader(test_set, batch_size, 
num_workers=num_workers) 


read 549367 examples 
read 9824 examples 


15.7.3 Fine-Tuning BERT 


As Fig. 15.6.2 indicates, fine-tuning BERT for natural language inference requires only an extra 
MLP consisting of two fully-connected layers (see self. hidden and self. output in the following 
BERTClassifier class). This MLP transforms the BERT representation of the special “<cls>” to- 
ken, which encodes the information of both the premise and the hypothesis, into three outputs of 
natural language inference: entailment, contradiction, and neutral. 


class BERTClassifier(nn.Block): 
def __init__(self, bert): 
super(BERTClassifier, self).__init__Q 
self.encoder = bert.encoder 
self.hidden = bert.hidden 
self.output = nn.Dense(3) 


def forward(self, inputs): 
tokens_X, segments_X, valid_lens_x = inputs 
encoded_X = self.encoder(tokens_X, segments_X, valid_lens_x) 
return self.output(self.hidden(encoded_XL:, 0, :])) 


In the following, the pretrained BERT model bert is fed into the BERTClassifier instance net for 
the downstream application. In common implementations of BERT fine-tuning, only the param- 
eters of the output layer of the additional MLP (net. output) will be learned from scratch. All the 
parameters of the pretrained BERT encoder (net. encoder) and the hidden layer of the additional 
MLP (net. hidden) will be fine-tuned. 
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net = BERTClassifier(bert) 
net.output.initialize(ctx=devices) 


Recall that in Section 14.8 both the MaskLM class and the NextSentencePred class have parameters 
in their employed MLPs. These parameters are part of those in the pretrained BERT model bert, 
and thus part of parameters in net. However, such parameters are only for computing the masked 
language modeling loss and the next sentence prediction loss during pretraining. These two loss 
functions are irrelevant to fine-tuning downstream applications, thus the parameters of the em- 
ployed MLPs in MaskLM and NextSentencePred are not updated (staled) when BERT is fine-tuned. 


To allow parameters with stale gradients, the flag ignore_stale_grad=True is setin the step func- 
tion of d21.train_batch_ch13. We use this function to train and evaluate the model net using the 
training set (train_iter) and the testing set (test_iter) of SNLI. Due to the limited computational 
resources, the training and testing accuracy can be further improved: we leave its discussions in 
the exercises. 


Ir, num_epochs = le-4, 5 

trainer = gluon.Trainer(net.collect_params(), ‘adam’, {'learning_rate’: 1r} 

loss = gluon.loss.SoftmaxCrossEntropyLoss() 

d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices, 
d21.split_batch_multi_inputs) 


loss 0.476, train acc 0.811, test acc 0.783 
8270.1 examples/sec on [gpu(0), gpu(1)] 


1.0 
0.8 
0.6 
0.4 
— train loss 
0.2 + === train acc 
—-= test acc 
0.0 





Summary 


e We can fine-tune the pretrained BERT model for downstream applications, such as natural 
language inference on the SNLI dataset. 


+ During fine-tuning, the BERT model becomes part of the model for the downstream appli- 
cation. Parameters that are only related to pretraining loss will not be updated during fine- 
tuning. 
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Exercises 


1. Fine-tune a much larger pretrained BERT model that is about as big as the original BERT base 


model if your computational resource allows. Set arguments in the load_pretrained_model 
function as: replacing ‘bert.small’ with ‘bert.base’, increasing values of num_hiddens=256, 
ffn_num_hiddens=512, num_heads=4, num_layers=2 to 768, 3072, 12, 12, respectively. By in- 
creasing fine-tuning epochs (and possibly tuning other hyperparameters), can you geta test- 
ing accuracy higher than 0.86? 


. How to truncate a pair of sequences according to their ratio of length? Compare this pair 


truncation method and the one used in the SNLIBERTDataset class. What are their pros and 
cons? 


Discussions?! 
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16 Recommender Systems 


Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google) 


Recommender systems are widely employed in industry and are ubiquitous in our daily lives. 
These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com), 
music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store 
and google play), online advertising, just to name a few. 


The major goal of recommender systems is to help users discover relevant items such as movies 
to watch, text to read or products to buy, so as to create a delightful user experience. Moreover, 
recommender systems are among the most powerful machine learning systems that online retail- 
ers implement in order to drive incremental revenue. Recommender systems are replacements 
of search engines by reducing the efforts in proactive searches and surprising users with offers 
they never searched for. Many companies managed to position themselves ahead of their com- 
petitors with the help of more effective recommender systems. As such, recommender systems 
are central to not only our everyday lives but also highly indispensable in some industries. 


In this chapter, we will cover the fundamentals and advancements of recommender systems, 
along with exploring some common fundamental techniques for building recommender systems 
with different data sources available and their implementations. Specifically, you will learn how 
to predict the rating a user might give to a prospective item, how to generate a recommendation 
list of items and how to predict the click-through rate from abundant features. These tasks are 
commonplace in real-world applications. By studying this chapter, you will get hands-on experi- 
ence pertaining to solving real world recommendation problems with not only classical methods 
but the more advanced deep learning based models as well. 


16.1 Overview of Recommender Systems 


In the last decade, the Internet has evolved into a platform for large-scale online services, which 
profoundly changed the way we communicate, read news, buy products, and watch movies. In 
the meanwhile, the unprecedented number of items (we use the term item to refer to movies, 
news, books, and products.) offered online requires a system that can help us discover items that 
we preferred. Recommender systems are therefore powerful information filtering tools that can 
facilitate personalized services and provide tailored experience to individual users. In short, rec- 
ommender systems play a pivotal role in utilizing the wealth of data available to make choices 
manageable. Nowadays, recommender systems are at the core of a number of online services 
providers such as Amazon, Netflix, and YouTube. Recall the example of Deep learning books rec- 
ommended by Amazon in Fig. 1.3.3. The benefits of employing recommender systems are two- 
folds: On the one hand, it can largely reduce users' effort in finding items and alleviate the issue of 
information overload. On the other hand, it can add business value to online service providers and 
is an important source of revenue. This chapter will introduce the fundamental concepts, classic 
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models and recent advances with deep learning in the field of recommender systems, together 
with implemented examples. 


Application 







Buy .. O Smartphone User Feedback 
Like 
AS 
“+... Watch 
User ` ian 
Dislike 


Recommendation 
list 


Fig. 16.1.1: Illustration of the Recommendation Process 


16.1.1 Collaborative Filtering 


We start the journey with the important concept in recommender systems—collaborative filtering 
(CF), which was first coined by the Tapestry system (Goldberg et al., 1992), referring to “people 
collaborate to help one another perform the filtering process in order to handle the large amounts 
of email and messages posted to newsgroups”. This term has been enriched with more senses. In 
a broad sense, it is the process of filtering for information or patterns using techniques involving 
collaboration among multiple users, agents, and data sources. CF has many forms and numerous 
CF methods proposed since its advent. 


Overall, CF techniques can be categorized into: memory-based CF, model-based CF, and their 
hybrid (Su & Khoshgoftaar, 2009). Representative memory-based CF techniques are nearest 
neighbor-based CF such as user-based CF and item-based CF (Sarwar et al., 2001). Latent factor 
models such as matrix factorization are examples of model-based CF. Memory-based CF has lim- 
itations in dealing with sparse and large-scale data since it computes the similarity values based 
on common items. Model-based methods become more popular with its better capability in deal- 
ing with sparsity and scalability. Many model-based CF approaches can be extended with neu- 
ral networks, leading to more flexible and scalable models with the computation acceleration 
in deep learning (Zhang et al., 2019). In general, CF only uses the user-item interaction data to 
make predictions and recommendations. Besides CF, content-based and context-based recom- 
mender systems are also useful in incorporating the content descriptions of items/users and con- 
textual signals such as timestamps and locations. Obviously, we may need to adjust the model 
types/structures when different input data is available. 
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16.1.2 Explicit Feedback and Implicit Feedback 


To learn the preference of users, the system shall collect feedback from them. The feedback can 
be either explicit or implicit (Hu et al., 2008). For example, IMDB?”” collects star ratings ranging 
from one to ten stars for movies. YouTube provides the thumbs-up and thumbs-down buttons for 
users to show their preferences. It is apparent that gathering explicit feedback requires users to 
indicate their interests proactively. Nonetheless, explicit feedback is not always readily available 
as many users may be reluctant to rate products. Relatively speaking, implicit feedback is often 
readily available since it is mainly concerned with modeling implicit behavior such as user clicks. 
As such, many recommender systems are centered on implicit feedback which indirectly reflects 
user's opinion through observing user behavior. There are diverse forms of implicit feedback 
including purchase history, browsing history, watches and even mouse movements. For example, 
a user that purchased many books by the same author probably likes that author. Note that implicit 
feedback is inherently noisy. We can only guess their preferences and true motives. A user watched 
a movie does not necessarily indicate a positive view of that movie. 


16.1.3 Recommendation Tasks 


A number of recommendation tasks have been investigated in the past decades. Based on the 
domain of applications, there are movies recommendation, news recommendations, point-of- 
interest recommendation (Ye et al., 2011) and so forth. It is also possible to differentiate the tasks 
based on the types of feedback and input data, for example, the rating prediction task aims to 
predict the explicit ratings. Top-n recommendation (item ranking) ranks all items for each user 
personally based on the implicit feedback. If time-stamp information is also included, we can 
build sequence-aware recommendation (Quadrana et al., 2018). Another popular task is called 
click-through rate prediction, which is also based on implicit feedback, but various categorical 
features can be utilized. Recommending for new users and recommending new items to existing 
users are called cold-start recommendation (Schein et al., 2002). 


Summary 
e Recommender systems are important for individual users and industries. Collaborative fil- 
tering is a key concept in recommendation. 


e There are two types of feedbacks: implicit feedback and explicit feedback. A number of 
recommendation tasks have been explored during the last decade. 


Exercises 


1. Can you explain how recommender systems influence your daily life? 
2. What interesting recommendation tasks do you think can be investigated? 


Discussions?!$ 





217 https://www.imdb.com/ 
718 https://discuss.d21.ai/t/398 
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16.2 The MovieLens Dataset 


There are a number of datasets that are available for recommendation research. Amongst 
them, the MovieLens”” dataset is probably one of the more popular ones. MovieLens is a non- 
commercial web-based movie recommender system. Itis created in 1997 and run by GroupLens, 
a research lab at the University of Minnesota, in order to gather movie rating data for research 
purposes. MovieLens data has been critical for several research studies including personalized 
recommendation and social psychology. 


16.2.1 Getting the Data 


The MovieLens dataset is hosted by the GroupLens?? website. Several versions are available. We 
will use the MovieLens 100K dataset (Herlocker et al., 1999). This dataset is comprised of 100, 000 
ratings, ranging from 1 to 5 stars, from 943 users on 1682 movies. It has been cleaned up so that 
each user has rated at least 20 movies. Some simple demographic information such as age, gender, 
genres for the users and items are also available. We can download the ml-100k.zip??! and extract 
the u.data file, which contains all the 100,000 ratings in the csv format. There are many other 
files in the folder, a detailed description for each file can be found in the README?? file of the 
dataset. 


To begin with, let us import the packages required to run this section’s experiments. 


from d21 import mxnet as d21 
from mxnet import gluon, np 
import os 

import pandas as pd 


Then, we download the MovieLens 100k dataset and load the interactions as DataFrame. 


#@save 

d21.DATA_HUB[ 'm1-100k'] = ( 
'http://files.grouplens.org/datasets/movielens/ml-100k.zip', 
"cd4dcac4241c8a4ad7badc7ca635da8a69dddb83') 


#@save 
def read_data_m1100k(): 
data_dir = d21.download_extract('ml-100k') 
names = ['user_id', 'item_id', ‘rating’, 'timestamp'] 
data = pd.read_csv(os.path.join(data_dir, 'u.data'), '1t', names=names, 
engine='python’) 
num_users = data.user_id.unique().shape[0] 
num_items = data.item_id.unique().shape[0] 
return data, num_users, num_items 





212 https://movielens.org/ 

22 https://grouplens.org/datasets/movielens/ 

221 http://files.grouplens.org/datasets/movielens/ml-100k.zip 

22 hitp://files.grouplens.org/datasets/movielens/ml-100k-README.txt 
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16.2.2 Statistics of the Dataset 


Let us load up the data and inspect the first five records manually. It is an effective way to learn 
the data structure and verify that they have been loaded properly. 


data, num_users, num_items = read_data_ml100k(> 

sparsity = 1 - len(data) / (num_users * num_items) 

print(f'number of users: {num_users}, number of items: {num_items}’) 
print(f'matrix sparsity: {sparsity:f}’) 

print(data.head(5)) 


number of users: 943, number of items: 1682 
matrix sparsity: 0.936953 
user_id item_id rating timestamp 


0 196 242 3 881250949 
1 186 302 3 891717742 
2 22 UY 1 878887116 
3 244 51 2 880606923 
4 166 346 1 886397596 


We can see that each line consists of four columns, including “user id” 1-943, “item id” 1-1682, 
“rating” 1-5 and “timestamp”. We can construct an interaction matrix of size n x m, where n and 
m are the number of users and the number of items respectively. This dataset only records the 
existing ratings, so we can also call it rating matrix and we will use interaction matrix and rating 
matrix interchangeably in case that the values of this matrix represent exact ratings. Most of the 
values in the rating matrix are unknown as users have not rated the majority of movies. We also 
show the sparsity of this dataset. The sparsity is defined as 1 - number of nonzero entries / ( 
number of users * number of items). Clearly, the interaction matrix is extremely sparse (i.e., 
sparsity = 93.695%). Real world datasets may suffer from a greater extent of sparsity and has been 
a long-standing challenge in building recommender systems. A viable solution is to use additional 
side information such as user/item features to alleviate the sparsity. 


We then plot the distribution of the count of different ratings. As expected, it appears to be a 
normal distribution, with most ratings centered at 3-4. 


d21.plt.hist(data[’rating’], bins=5, ec='black’) 
d21.p1t.xlabel('Rating') 

d21.p1t.ylabel('Count”> 

d21.p1t.title('Distribution of Ratings in MovieLens 100K’) 
d21.p1t.show() 
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16.2.3 Splitting the dataset 


We split the dataset into training and test sets. The following function provides two split modes 
including random and seq-aware. In the random mode, the function splits the 100k interactions 
randomly without considering timestamp and uses the 90% of the data as training samples and the 
rest 10% as test samples by default. In the seq-aware mode, we leave out the item that a user rated 
most recently for test, and users’ historical interactions as training set. User historical interactions 
are sorted from oldest to newest based on timestamp. This mode will be used in the sequence- 
aware recommendation section. 


#@save 
def split_data_ml100k(data, num_users, num_items, 
split_mode='random', test_ratio=0.1): 
"""Split the dataset in random mode or seq-aware mode.”"" 
if split_mode == 'seq-aware’: 
train_items, test_items, train_list = {}, {}, [] 
for line in data.itertuples(): 
u, i, rating, time = line[1], line[2], line[3], line[4] 
train_items.setdefault(u, []).append((u, i, rating, time)) 
if u not in test_items or test_items[u][-1] < time: 
test_items[u] = (i, rating, time) 
for u in range(1, num_users + 1): 
train_list.extend(sorted(train_items[u], key=lambda k: k[3])) 
test_data = [(key, *value) for key, value in test_items.items()] 
train_data = [item for item in train_list if item not in test_data] 
train_data = pd.DataFrame(train_data) 
test_data = pd.DataFrame(test_data) 
else: 
mask = [True if x == 1 else False for x in np.random.uniform( 
0, 1, (len(data))) < 1 - test_ratio] 


(continues on next page) 





758 Chapter 16. Recommender Systems 


(continued from previous page) 


neg_mask = [not x for x in mask] 
train_data, test_data = data[mask], data[neg_mask] 
return train_data, test_data 


Note that itis good practice to use a validation set in practice, apart from only a test set. However, 
we omit that for the sake of brevity. In this case, our test set can be regarded as our held-out 
validation set. 


16.2.4 Loading the data 


After dataset splitting, we will convert the training set and test setinto lists and dictionaries/matrix 
for the sake of convenience. The following function reads the dataframe line by line and enu- 
merates the index of users/items start from zero. The function then returns lists of users, items, 
ratings and a dictionary/matrix that records the interactions. We can specify the type of feedback 
to either explicit or implicit. 


#@save 
def load_data_m1100k(data, num_users, num_items, feedback='explicit’): 
users, items, scores = [], [], [] 
inter = np.zeros((num_items, num_users)) if feedback == ‘explicit’ else {} 
for line in data.itertuples(): 
user_index, item_index = int(line[1] - 1), int(line[2] - 1) 
score = int(line[3]) if feedback == ‘explicit’ else 1 
users. append(user_index) 
items. append(item_index) 
scores.append(score) 
if feedback == ‘implicit’: 
inter.setdefault(user_index, []).append(item_index) 
else: 
inter[item_index, user_index] = score 
return users, items, scores, inter 


Afterwards, we put the above steps together and it will be used in the next section. The results are 
wrapped with Dataset and DataLoader. Note that the last_batch of DataLoader for training data 
is set to the rollover mode (The remaining samples are rolled over to the next epoch.) and orders 
are shuffled. 


#@save 
def split_and_load_m1100k(split_mode='seq-aware', feedback='explicit', 
test_ratio=0.1, batch_size=256): 
data, num_users, num_items = read_data_ml100k(> 
train_data, test_data = split_data_m1100k( 
data, num_users, num_items, split_mode, test_ratio) 


train_u, train_i, train_r, _ = load_data_m1100k( 
train_data, num_users, num_items, feedback) 
test_u, test_i, test_r, _ = load_data_ml100k( 


test_data, num_users, num_items, feedback) 
train_set = gluon.data.ArrayDataset( 

np.array(train_u), np.array(train_i), np.array(train_r)) 
test_set = gluon.data.ArrayDataset( 

np.array(test_u), np.array(test_i), np.array(test_r)) 


(continues on next page) 
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(continued from previous page) 
train_iter = gluon.data.DataLoader( 
train_set, shuffle=True, last_batch='rollover', 
batch_size=batch_size) 
test_iter = gluon.data.DataLoader ( 
test_set, batch_size=batch_size) 
return num_users, num_items, train_iter, test_iter 


Summary 


e MovieLens datasets are widely used for recommendation research. It is public available and 
free to use. 


e We define functions to download and preprocess the MovieLens 100k dataset for further use 
in later sections. 


Exercises 


e What other similar recommendation datasets can you find? 


* Go through the https://movielens.org/ site for more information about MovieLens. 


Discussions?2 


16.3 Matrix Factorization 


Matrix Factorization (Koren et al., 2009) is a well-established algorithm in the recommender sys- 
tems literature. The first version of matrix factorization model is proposed by Simon Funk in a 
famous blog post”?* in which he described the idea of factorizing the interaction matrix. It then 
became widely known due to the Netflix contest which was held in 2006. At that time, Netflix, a 
media-streaming and video-rental company, announced a contest to improve its recommender 
system performance. The best team that can improve on the Netflix baseline, i.e., Cinematch), 
by 10 percent would win a one million USD prize. As such, this contest attracted a lot of attention 
to the field of recommender system research. Subsequently, the grand prize was won by the Bel- 
lKor’s Pragmatic Chaos team, a combined team of BellKor, Pragmatic Theory, and BigChaos (you 
do not need to worry about these algorithms now). Although the final score was the result of an 
ensemble solution (i.e., a combination of many algorithms), the matrix factorization algorithm 
played a critical role in the final blend. The technical report of the Netflix Grand Prize solution 
(Toscher et al., 2009) provides a detailed introduction to the adopted model. In this section, we 
will dive into the details of the matrix factorization model and its implementation. 





2 https://discuss.d21.ai/t/399 
4 hitps://sifter.org/~simon/journal/20061211.html 





760 Chapter 16. Recommender Systems 


16.3.1 The Matrix Factorization Model 


Matrix factorization is a class of collaborative filtering models. Specifically, the model factorizes 
the user-item interaction matrix (e.g., rating matrix) into the product of two lower-rank matrices, 
capturing the low-rank structure of the user-item interactions. 


Let R € R'”*” denote the interaction matrix with m users and n items, and the values of R 
represent explicit ratings. The user-item interaction will be factorized into a user latent matrix 
P e R”** and an item latent matrix Q € R"**, where k < m,n, is the latent factor size. Let pu 
denote the u'? row of P and q; denote the i“ row of Q. For a given item i, the elements of q; mea- 
sure the extent to which the item possesses those characteristics such as the genres and languages 
of a movie. For a given user u, the elements of p,, measure the extent of interest the user has in 
items' corresponding characteristics. These latent factors might measure obvious dimensions as 
mentioned in those examples or are completely uninterpretable. The predicted ratings can be 
estimated by 


R=PQ' (16.3.1) 


where R e R”*” isthe predicted rating matrix which has the same shape as R. One major problem 
of this prediction rule is that users/items biases can not be modeled. For example, some users 
tend to give higher ratings or some items always get lower ratings due to poorer quality. These 
biases are commonplace in real-world applications. To capture these biases, user specific and 
item specific bias terms are introduced. Specifically, the predicted rating user u gives to item i is 
calculated by 


Rui = Pug; + bu +b; (16.3.2) 


Then, we train the matrix factorization model by minimizing the mean squared error between 
predicted rating scores and real rating scores. The objective function is defined as follows: 


argmin Y [Ru — Ruill? + ACIPS + IQI? +02 +0?) lees 
PQb (wie 


where A denotes the regularization rate. The regularizing term A(||P||7 + ||Q||7 + 02 + b?) is used 
to avoid over-fitting by penalizing the magnitude of the parameters. The (u, ¿) pairs for which Ru; 
is known are stored in the set K = {(u, i) | Ru; is known}. The model parameters can be learned 
with an optimization algorithm, such as Stochastic Gradient Descent and Adam. 


An intuitive illustration of the matrix factorization model is shown below: 


i i 
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u u 
A 


n k 





Fig. 16.3.1: Illustration of matrix factorization model 
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In the rest of this section, we will explain the implementation of matrix factorization and train the 
model on the MovieLens dataset. 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 

import mxnet as mx 

npx.set_np() 


16.3.2 Model Implementation 


First, we implement the matrix factorization model described above. The user and item latent 
factors can be created with the nn.Embedding. The input_dim is the number of items/users and 
the (output_dim) is the dimension of the latent factors (k). We can also use nn. Embedding to create 
the user/item biases by setting the output_dim to one. In the forward function, user and item ids 
are used to look up the embeddings. 


class MF(nn.Block): 
def __init__(self, num_factors, num_users, num_items, **kwargs): 
super(MF, self).__init__(**kwargs) 
self.P = nn.Embedding(input_dim=num_users, output_dim=num_factors) 
self.Q = nn.Embedding(input_dim=num_items, output_dim=num_factors) 
self.user_bias = nn.Embedding(num_users, 1) 
self.item_bias = nn.Embedding(num_items, 1) 


def forward(self, user_id, item_id): 
P_u = self.P(user_id) 
Q_i = self.Q(item_id) 
b_u = self.user_bias(user_id) 
b_i = self.item_bias(item_id) 
outputs = (P_u * Q_i).sum(axis=1) + np.squeeze(b_u) + np.squeeze(b_i) 
return outputs. flatten() 


16.3.3 Evaluation Measures 


We then implement the RMSE (root-mean-square error) measure, which is commonly used to 
measure the differences between rating scores predicted by the model and the actually observed 
ratings (ground truth) (Gunawardana & Shani, 2015). RMSE is defined as: 





1 A 
RMSE= |— S (Rui — Rus)? (16.3.4) 
IT] (u)ET 


where 7 is the set consisting of pairs of users and items that you want to evaluate on. |7| is the 
size of this set. We can use the RMSE function provided by mx.metric. 


def evaluator(net, test_iter, devices): 
rmse = mx.metric.RMSE() # Get the RMSE 
rmse_list = [] 
for idx, (users, items, ratings) in enumerate(test_iter): 
u = gluon.utils.split_and_load(users, devices, even_split=False) 


(continues on next page) 
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(continued from previous page) 


i = gluon.utils.split_and_load(items, devices, even_split=False) 
r_ui = gluon.utils.split_and_load(ratings, devices, even_split=False) 
r_hat = [net(u, i) for u, i in zip(u, i)] 
rmse.update(labels=r_ui, preds=r_hat) 
rmse_list.append(rmse. get()[1]) 
return float(np.mean(np.array(rmse_list))) 


16.3.4 Training and Evaluating the Model 


In the training function, we adopt the Lə loss with weight decay. The weight decay mechanism 
has the same effect as the Lə regularization. 


#@save 
def train_recsys_rating(net, train_iter, test_iter, loss, trainer, num_epochs, 
devices=d21.try_all_gpus(), evaluator=None, 
**xkwargs) : 
timer = d21.Timer() 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 2], 
legend=['train loss’, 'test RMSE’]) 
for epoch in range(num_epochs): 
metric, 1 = d21.Accumulator(3), 0. 
for i, values in enumerate(train_iter): 
timer.start() 
input_data = [] 
values = values if isinstance(values, list) else [values] 
for v in values: 
input_data. append(gluon.utils.split_and_load(v, devices)) 
train_feat = input_data[0:-1] if len(values) > 1 else input_data 
train_label = input_data[-1] 
with autograd.record(): 
preds = [net(*xt) for t in zip(*train_feat)] 
ls = [loss(p, s) for p, s in zip(preds, train_label)] 
[1.backward() for 1 in 1s] 
1 += sum([1.asnumpy() for 1 in 1s]).mean() / len(devices) 
trainer.step(values[0].shape[L0]) 
metric.add(1, values[0].shape[0], values[0].size) 
timer.stop() 
if len(kwargs) > 0: + It will be used in section AutoRec 
test_rmse = evaluator(net, test_iter, kwargs['inter_mat'], 
devices) 
else: 
test_rmse = evaluator(net, test_iter, devices) 
train al A/A T 
animator.add(epoch + 1, (train_l, test_rmse)) 
print(f'train loss (metric[0] / metric[1]:.3f}, ' 
f'test RMSE {test_rmse: .3f}') 
print(f'(metric[2] * num_epochs / timer.sum():.1f} examples/sec 
f'on (str(devices))') 


Finally, let us put all things together and train the model. Here, we set the latent factor dimension 
to 30. 
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devices = d21.try_all_gpus() 

num_users, num_items, train_iter, test_iter = d21.split_and_load_m1100k( 
test_ratio=0.1, batch_size=512) 

net = MF(30, num_users, num_items) 

net.initialize(ctx=devices, force_reinit=True, init=mx.init.Normal(0.01)) 

lr, num_epochs, wd, optimizer = 0.002, 20, le-5, ‘adam’ 

loss = gluon.loss.L2Loss() 

trainer = gluon.Trainer(net.collect_params(), optimizer, 

{"learning_rate”: lr, 'wd': wdy) 
train_recsys_rating(net, train_iter, test_iter, loss, trainer, num_epochs, 
devices, evaluator) 


train loss 0.064, test RMSE 1.051 
66287.6 examples/sec on [gpu(0), gpu(1)] 
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Below, we use the trained model to predict the rating that a user (ID 20) might give to an item (ID 
30). 


scores = net(np.array([20], dtype='int', ctx=devices[0]), 


np.array([30], dtype='int', ctx=devices[0])) 
scores 


array([2.9934], ctx=gpu(0)) 


Summary 


* The matrix factorization model is widely used in recommender systems. It can be used to 
predict ratings that a user might give to an item. 


+ We can implement and train matrix factorization for recommender systems. 
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Exercises 


e Vary the size of latent factors. How does the size of latent factors influence the model per- 
formance? 


* Try different optimizers, learning rates, and weight decay rates. 


e Check the predicted rating scores of other users for a specific movie. 


Discussions?2 


16.4 AutoRec: Rating Prediction with Autoencoders 


Although the matrix factorization model achieves decent performance on the rating prediction 
task, it is essentially a linear model. Thus, such models are not capable of capturing complex 
nonlinear and intricate relationships that may be predictive of users' preferences. In this section, 
we introduce a nonlinear neural network collaborative filtering model, AutoRec (Sedhain et al., 
2015). It identifies collaborative filtering (CF) with an autoencoder architecture and aims to inte- 
grate nonlinear transformations into CF on the basis of explicit feedback. Neural networks have 
been proven to be capable of approximating any continuous function, making it suitable to ad- 
dress the limitation of matrix factorization and enrich the expressiveness of matrix factorization. 


On one hand, AutoRec has the same structure as an autoencoder which consists of an input layer, 
a hidden layer, and a reconstruction (output) layer. An autoencoder is a neural network that 
learns to copy its input to its output in order to code the inputs into the hidden (and usually 
low-dimensional) representations. In AutoRec, instead of explicitly embedding users/items into 
low-dimensional space, it uses the column/row of the interaction matrix as the input, then recon- 
structs the interaction matrix in the output layer. 


On the other hand, AutoRec differs from a traditional autoencoder: rather than learning the hid- 
den representations, AutoRec focuses on learning/reconstructing the output layer. It uses a par- 
tially observed interaction matrix as the input, aiming to reconstruct a completed rating matrix. 
In the meantime, the missing entries of the input are filled in the output layer via reconstruction 
for the purpose of recommendation. 


There are two variants of AutoRec: user-based and item-based. For brevity, here we only introduce 
the item-based AutoRec. User-based AutoRec can be derived accordingly. 


16.4.1 Model 


Let R,; denote the ¡'* column of the rating matrix, where unknown ratings are set to zeros by 
default. The neural architecture is defined as: 


h(Ryi) = f(W - g(VRui + u) +0) (16.4.1) 


where f(-) and g(-) represent activation functions, W and V are weight matrices, u and b are biases. 
Let h(-) denote the whole network of AutoRec. The output h(R,;) is the reconstruction of the ¡$ 
column of the rating matrix. 
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The following objective function aims to minimize the reconstruction error: 


M 


argmin X || Rei — (Rei) | + ACW? + IVI) (16.4.2) 
W~.V, 1,0 i=l 


where || - ||y means only the contribution of observed ratings are considered, that is, only weights 
that are associated with observed inputs are updated during back-propagation. 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 

import mxnet as mx 


npx.set_np() 


16.4.2 Implementing the Model 


A typical autoencoder consists of an encoder and a decoder. The encoder projects the input to 
hidden representations and the decoder maps the hidden layer to the reconstruction layer. We 
follow this practice and create the encoder and decoder with dense layers. The activation of en- 
coder is set to sigmoid by default and no activation is applied for decoder. Dropout is included 
after the encoding transformation to reduce over-fitting. The gradients of unobserved inputs are 
masked out to ensure that only observed ratings contribute to the model learning process. 


class AutoRec(nn.Block): 
def __init__(self, num_hidden, num_users, dropout=0.05): 
super(AutoRec, self).__init__() 
self.encoder = nn.Dense(num_hidden, activation='sigmoid', 
use_bias=True) 
self.decoder = nn.Dense(num_users, use_bias=True) 
self.dropout = nn.Dropout (dropout) 


def forward(self, input): 
hidden = self.dropout(self.encoder (input) ) 
pred = self.decoder (hidden) 
if autograd.is_training(): # Mask the gradient during training 
return pred * np.sign(input) 
else: 
return pred 


16.4.3 Reimplementing the Evaluator 


Since the input and output have been changed, we need to reimplement the evaluation function, 
while we still use RMSE as the accuracy measure. 


def evaluator(network, inter_matrix, test_data, devices): 
scores = [] 
for values in inter_matrix: 
feat = gluon.utils.split_and_load(values, devices, even_split=False) 
scores.extend([network(i).asnumpy() for i in feat]) 


(continues on next page) 
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(continued from previous page) 


recons = np.array(Litem for sublist in scores for item in sublist]) 

# Calculate the test RMSE 

rmse = np.sqrt(np.sum(np.square(test_data - np.sign(test_data) * recons)) 
/ np.sum(np.sign(test_data))) 

return float(rmse) 


16.4.4 Training and Evaluating the Model 


Now, let us train and evaluate AutoRec on the MovieLens dataset. We can clearly see that the 
test RMSE is lower than the matrix factorization model, confirming the effectiveness of neural 
networks in the rating prediction task. 


devices = d21.try_all_gpus() 

# Load the MovieLens 100K dataset 

df, num_users, num_items = d21.read_data_m1100k() 

train_data, test_data = d21.split_data_m1100k(df, num_users, num_items) 

_, -, -, train_inter_mat = d21.load_data_m1100k(train_data, num_users, 

num_items) 
, —, -, test_inter_mat = d21.load_data_m1100k(test_data, num_users, 
num_items) 

train_iter = gluon.data.DataLoader(train_inter_mat, shuffle=True, 
last_batch="rollover”, batch_size=256, 
num_workers=d21.get_dataloader_workers()) 

test_iter = gluon.data.DataLoader(np.array(train_inter_mat), shuffle=False, 
last_batch="keep”, batch_size=1024, 
num_workers=d21.get_dataloader_workers()) 

# Model initialization, training, and evaluation 

net = AutoRec(500, num_users) 

net.initialize(ctx=devices, force_reinit=True, init=mx.init.Normal(0.01)) 

lr, num_epochs, wd, optimizer = 0.002, 25, le-5, ‘adam’ 

loss = gluon.loss.L2Loss() 

trainer = gluon.Trainer(net.collect_params(), optimizer, 

{"learning_rate”: lr, 'wd': wd}) 
d21.train_recsys_rating(net, train_iter, test_iter, loss, trainer, num_epochs, 
devices, evaluator, inter_mat=test_inter_mat) 





train loss 0.000, test RMSE 0.897 
34730445.1 examples/sec on [gpu(0), gpu(1)] 
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—— train loss 
==- test RMSE 





Summary 


+ We can frame the matrix factorization algorithm with autoencoders, while integrating non- 
linear layers and dropout regularization. 


e Experiments on the MovieLens 100K dataset show that AutoRec achieves superior perfor- 
mance than matrix factorization. 


Exercises 


e Vary the hidden dimension of AutoRec to see its impact on the model performance. 
* Try to add more hidden layers. Is it helpful to improve the model performance? 


* Can you find a better combination of decoder and encoder activation functions? 


Discussions2”° 


16.5 Personalized Ranking for Recommender Systems 


In the former sections, only explicit feedback was considered and models were trained and tested 
on observed ratings. There are two demerits of such methods: First, most feedback is not explicit 
but implicit in real-world scenarios, and explicit feedback can be more expensive to collect. Sec- 
ond, non-observed user-item pairs which may be predictive for users’ interests are totally ignored, 
making these methods unsuitable for cases where ratings are not missing at random but because 
of users’ preferences. Non-observed user-item pairs are a mixture of real negative feedback (users 
are not interested in the items) and missing values (the user might interact with the items in the 
future). We simply ignore the non-observed pairs in matrix factorization and AutoRec. Clearly, 
these models are incapable of distinguishing between observed and non-observed pairs and are 
usually not suitable for personalized ranking tasks. 


To this end, a class of recommendation models targeting at generating ranked recommendation 
lists from implicit feedback have gained popularity. In general, personalized ranking models can 
be optimized with pointwise, pairwise or listwise approaches. Pointwise approaches considers 
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a single interaction at a time and train a classifier or a regressor to predict individual prefer- 
ences. Matrix factorization and AutoRec are optimized with pointwise objectives. Pairwise ap- 
proaches consider a pair of items for each user and aim to approximate the optimal ordering for 
that pair. Usually, pairwise approaches are more suitable for the ranking task because predict- 
ing relative order is reminiscent to the nature of ranking. Listwise approaches approximate the 
ordering of the entire list of items, for example, direct optimizing the ranking measures such 
as Normalized Discounted Cumulative Gain (NDCG??”). However, listwise approaches are more 
complex and compute-intensive than pointwise or pairwise approaches. In this section, we will 
introduce two pairwise objectives/losses, Bayesian Personalized Ranking loss and Hinge loss, and 
their respective implementations. 


16.5.1 Bayesian Personalized Ranking Loss and its Implementation 


Bayesian personalized ranking (BPR) (Rendle et al., 2009) is a pairwise personalized ranking loss 
that is derived from the maximum posterior estimator. It has been widely used in many existing 
recommendation models. The training data of BPR consists of both positive and negative pairs 
(missing values). It assumes that the user prefers the positive item over all other non-observed 
items. 


In formal, the training data is constructed by tuples in the form of (u, i, j), which represents that 
the user u prefers the item i over the item j. The Bayesian formulation of BPR which aims to 
maximize the posterior probability is given below: 


p(O |>u) « p(>ul ©)p(9) (16.5.1) 


Where O represents the parameters of an arbitrary recommendation model, >, represents the de- 
sired personalized total ranking of all items for user u. We can formulate the maximum posterior 
estimator to derive the generic optimization criterion for the personalized ranking task. 
BPR-OPT : = Inp(O |>,,) 
x In p(>u| O)p(0) 


=In II a (Gui — Juz )P() 


(ui jED) (16.5.2) 
= y In o(Gui — uj) + In p(O) 
(u,i,7jED) 
= > In o (ĝui = Guz) = e||9|? 
(u,i,jED) 


where D := {(u,i,j) |i € If Aj € NL) is the training set, with I7 denoting the items the user 
u liked, I denoting all items, and J\I indicating all other items excluding items the user liked. 
Yui and Ñu; are the predicted scores of the user u to item i and j, respectively. The prior p(O) is a 
normal distribution with zero mean and variance-covariance matrix io. Here, we let No = Aol. 
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We will implement the base class mxnet.gluon.loss.Loss and override the forward method to 
construct the Bayesian personalized ranking loss. We begin by importing the Loss class and the 
np module. 





item 


from mxnet import gluon, np, npx 
npx.set_np() 


The implementation of BPR loss is as follows. 


#@save 
class BPRLoss(gluon.loss.Loss): 
def __init__(self, weight=None, batch_axis=0, **kwargs): 
super(BPRLoss, self).__init__(weight=None, batch_axis=0, **kwargs) 


def forward(self, positive, negative): 
distances = positive - negative 
loss = - np.sum(np.log(npx.sigmoid(distances)), 0, keepdims=True) 
return loss 


16.5.2 Hinge Loss and its Implementation 


The Hinge loss for ranking has different form to the hinge loss?” provided within the gluon library 
that is often used in classifiers such as SVMs. The loss used for ranking in recommender systems 
has the following form. 


max(m — Yui + Juj, 0) (16.5.3) 
(u,i,j€D) 


where m is the safety margin size. It aims to push negative items away from positive items. Similar 
to BPR, it aims to optimize for relevant distance between positive and negative samples instead of 
absolute outputs, making it well suited to recommender systems. 


#@save 
class HingeLossbRec(gluon.loss.Loss): 


(continues on next page) 
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(continued from previous page) 
def __init__(self, weight=None, batch_axis=0, **kwargs): 


super(HingeLossbRec, self).__init__(weight=None, batch_axis=0, 
**kkwargs) 


def forward(self, positive, negative, margin=1): 
distances = positive - negative 
loss = np.sum(np.maximum(- distances + margin, 0)) 
return loss 


These two losses are interchangeable for personalized ranking in recommendation. 


Summary 


+ There are three types of ranking losses available for the personalized ranking task in recom- 
mender systems, namely, pointwise, pairwise and listwise methods. 


° The two pairwise loses, Bayesian personalized ranking loss and hinge loss, can be used in- 
terchangeably. 


Exercises 


+ Are there any variants of BPR and hinge loss available? 


e Can you find any recommendation models that use BPR or hinge loss? 


Discussions??? 


16.6 Neural Collaborative Filtering for Personalized Ranking 


This section moves beyond explicit feedback, introducing the neural collaborative filtering (NCF) 
framework for recommendation with implicit feedback. Implicit feedback is pervasive in recom- 
mender systems. Actions such as Clicks, buys, and watches are common implicit feedback which 
are easy to collect and indicative of users’ preferences. The model we will introduce, titled NeuMF 
(He et al., 2017b), short for neural matrix factorization, aims to address the personalized rank- 
ing task with implicit feedback. This model leverages the flexibility and non-linearity of neural 
networks to replace dot products of matrix factorization, aiming at enhancing the model expres- 
siveness. In specific, this model is structured with two subnetworks including generalized matrix 
factorization (GMF) and MLP and models the interactions from two pathways instead of simple in- 
ner products. The outputs of these two networks are concatenated for the final prediction scores 
calculation. Unlike the rating prediction task in AutoRec, this model generates a ranked recom- 
mendation list to each user based on the implicit feedback. We will use the personalized ranking 
loss introduced in the last section to train this model. 
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16.6.1 The NeuMF model 


As aforementioned, NeuMF fuses two subnetworks. The GMF is a generic neural network version 
of matrix factorization where the input is the elementwise product of user and item latent factors. 
It consists of two neural layers: 


16.6.1 
ine = a), man 





where © denotes the Hadamard product of vectors. P e R™** and Q € R”** corespond to user 
and item latent matrix respectively. pu € R* is the u® row of P and q; € R* is the i row of Q. a 
and h denote the activation function and weight of the output layer. ĝui is the prediction score of 
the user u might give to the item i. 











Another component of this model is MLP. To enrich model flexibility, the MLP subnetwork does 
not share user and item embeddings with GMF. It uses the concatenation of user and item embed- 
dings as input. With the complicated connections and nonlinear transformations, it is capable of 
estimating the intricate interactions between users and items. More precisely, the MLP subnet- 
work is defined as: 

2) = hı (Uu, Vi) = (Un, Vi] 

6) (2%) = ab (Ww) 2 440) 
sis (16.6.2) 
oP (2D) = ab WED + p)) 

fui = (MED) 
where W*,b* and a* denote the weight matrix, bias vector, and activation function. ¢* denotes 
the function of the corresponding layer. z* denotes the output of corresponding layer. 


To fuse the results of GMF and MLP, instead of simple addition, NeuMF concatenates the second 
last layers of two subnetworks to create a feature vector which can be passed to the further lay- 
ers. Afterwards, the ouputs are projected with matrix h and a sigmoid activation function. The 
prediction layer is formulated as: 


ui = o (h! [x, pt (2-Y))), (16.6.3) 


The following figure illustrates the model architecture of NeuMF. 
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Fig. 16.6.1: Illustration of the NeuMF model 


from d21 import mxnet as d21 

from mxnet import autograd, gluon, np, npx 
from mxnet.gluon import nn 

import mxnet as mx 

import random 


npx.set_np() 


16.6.2 Model Implementation 


The following code implements the NeuMF model. It consists of a generalized matrix factoriza- 
tion model and a multi-layered perceptron with different user and item embedding vectors. The 
structure of the MLP is controlled with the parameter nums_hiddens. ReLU is used as the default 
activation function. 


class NeuMF(nn.Block): 
def __init__(self, num_factors, num_users, num_items, nums_hiddens, 
xxkwargs): 
super(NeuMF, self).__init__(**kwargs) 
self.P = nn.Embedding(num_users, num_factors) 
self.Q = nn.Embedding(num_items, num_factors) 
self.U = nn.Embedding(num_users, num_factors) 
self.V = nn.Embedding(num_items, num_factors) 
self.mlp = nn.Sequential() 
for num_hiddens in nums_hiddens: 
self.mlp.add(nn.Dense(num_hiddens, activation='relu', 
use_bias=True)) 
self .prediction_layer = nn.Dense(1, activation='sigmoid', use_bias=False) 


(continues on next page) 
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def forward(self, user_id, item_id): 

p_mf = self .P(user_id) 

q_mf = self .Q(item_id) 

emf = p_mf x q_mf 

p_mlp = self .U(user_id) 

q_mlp = self.V(item_id) 

mlp = self.mlp(np.concatenate([p_mlp, q_mlp], axis=1)) 

con_res = np.concatenate([gmf, mlp], axis=1) 

return self.prediction_layer(con_res) 


16.6.3 Customized Dataset with Negative Sampling 


For pairwise ranking loss, an important step is negative sampling. For each user, the items that 
a user has not interacted with are candidate items (unobserved entries). The following function 
takes users identity and candidate items as input, and samples negative items randomly for each 
user from the candidate set of that user. During the training stage, the model ensures that the 
items that a user likes to be ranked higher than items he dislikes or has not interacted with. 


class PRDataset(gluon.data.Dataset): 
def __init__(self, users, items, candidates, num_items): 
self.users = users 
self.items = items 
self.cand = candidates 
self.all = set([i for i in range(num_items) ]) 


def __len__(self): 
return len(self.users) 


def __getitem__(self, idx): 
neg_items = list(self.all - set(self.cand[int(self.users[idx])])) 
indices = random.randint(@, len(neg_items) - 1) 
return self.usersLidx], self.items[idx], neg_items[indices] 


16.6.4 Evaluator 


In this section, we adopt the splitting by time strategy to construct the training and test sets. Two 
evaluation measures including hit rate at given cutting off £ (Hit@/) and area under the ROC curve 
(AUC) are used to assess the model effectiveness. Hit rate at given position / for each user indicates 
that whether the recommended item is included in the top £ ranked list. The formal definition is 
as follows: 


1 
Hi = — 1 u =t), 
it@ 2 (ranku, gu <= £) (16.6.4) 


where 1 denotes an indicator function that is equal to one if the ground truth item is ranked in the 
top £ list, otherwise it is equal to zero. ranku,g, denotes the ranking of the ground truth item gu of 
the user u in the recommendation list (The ideal ranking is 1). m is the number of users. U is the 
user set. 
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The definition of AUCis as follows: 


1 1 
AUC = — 
m 22 AS 





y L(ranku g, < ranky,;), (16.6.5) 
JEI\Su 


where Z is the item set. S,, is the candidate items of user u. Note that many other evaluation 
protocols such as precision, recall and normalized discounted cumulative gain (NDCG) can also 
be used. 


The following function calculates the hit counts and AUC for each user. 


#@save 
def hit_and_auc(rankedlist, test_matrix, k): 
hits_k = [(idx, val) for idx, val in enumerate(rankedlist[:k]) 
if val in set(test_matrix)] 
hits_all = [(idx, val) for idx, val in enumerate(rankedlist) 
if val in set(test_matrix)] 
max = len(rankedlist) - 1 
auc = 1.0 x (max - hits_al1[0][0]) / max if len(hits_all) > @ else ð 
return len(hits_k), auc 


Then, the overall Hit rate and AUC are calculated as follows. 


#@save 
def evaluate_ranking(net, test_input, seq, candidates, num_users, num_items, 
devices): 
ranked_list, ranked_items, hit_rate, auc = {}, {}, [J, O] 
all_items = set([i for i in range(num_users)]) 
for u in range(num_users): 
neg_items = list(all_items - set(candidates[int(u)])) 
user_ids, item_ids, x, scores = [], [], [], [] 
[item_ids.append(i) for i in neg_items] 
[user_ids.append(u) for _ in neg_items] 
x.extend([np.array(user_ids)]) 
if seq is not None: 
x.append(segluser_ids, :]) 
x.extend([np.array(item_ids)]) 
test_data_iter = gluon.data.DataLoader( 
gluon.data.ArrayDataset(*x), shuffle=False, last_batch="keep”, 
batch_size=1024) 
for index, values in enumerate(test_data_iter): 
x = [gluon.utils.split_and_load(v, devices, even_split=False) 
for v in values] 
scores.extend([list(net(*xt).asnumpyO) for t in zip(*x)]) 
scores = [item for sublist in scores for item in sublist] 
item_scores = list(zip(item_ids, scores)) 
ranked_list[lu] = sorted(item_scores, key=lambda t: t[1], reverse=True) 
ranked_items[u] = [r[@] for r in ranked_list[u]] 
temp = hit_and_auc(ranked_items[u], test_input[u], 50) 
hit_rate.append(temp[0]) 
auc.append(temp[1]) 
return np.mean(np.array(hit_rate)), np.mean(np.array(auc)) 
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16.6.5 Training and Evaluating the Model 


The training function is defined below. We train the model in the pairwise manner. 


#@save 
def train_ranking(net, train_iter, test_iter, loss, trainer, test_seq_iter, 
num_users, num_items, num_epochs, devices, evaluator, 
candidates, eval_step=1): 
timer, hit_rate, auc = d21.Timer(), 0, ð 
animator = d21.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1], 
legend=['test hit rate’, 'test AUC']) 
for epoch in range(num_epochs): 
metric, 1 = d21.Accumulator(3), 0. 
for i, values in enumerate(train_iter): 
input_data = [] 
for v in values: 
input_data.append(gluon.utils.split_and_load(v, devices)) 
with autograd.record(): 
p_pos = [net(«t) for t in zip(*input_data[0:-1])] 
p_neg = [net(«t) for t in zip(*input_data[0:-2], 
input_data[-1])] 
ls = [loss(p, n) for p, n in zip(p_pos, p_neg)] 
[1.backward(retain_graph=False) for 1 in ls] 
1 += sum([1.asnumpy() for 1 in 1s]).mean()/len(devices) 
trainer.step(values[0].shape[0]) 
metric.add(1, values[0].shape[0], values[0].size) 
timer.stop() 
with autograd.predict_mode(): 
if (epoch + 1) % eval_step == 0: 
hit_rate, auc = evaluator(net, test_iter, test_seq_iter, 
candidates, num_users, num_items, 
devices) 
animator.add(epoch + 1, (hit_rate, auc)) 
print(f'train loss (metric[0] / metric[1]:.3f}, ’ 
f'test hit rate {float(hit_rate):.3f}, test AUC {float(auc):.3f}’) 
print(f'(metric[2] * num_epochs / timer.sum():.1f} examples/sec ' 
f'on {str(devices) }’) 


Now, we can load the MovieLens 100k dataset and train the model. Since there are only ratings in 
the MovieLens dataset, with some losses of accuracy, we binarize these ratings to zeros and ones. 
If a user rated an item, we consider the implicit feedback as one, otherwise as zero. The action of 
rating an item can be treated as a form of providing implicit feedback. Here, we split the dataset 
in the seq-aware mode where users’ latest interacted items are left out for test. 


batch_size = 1024 

df, num_users, num_items = d21.read_data_m1100k() 

train_data, test_data = d21.split_data_ml100k(df, num_users, num_items, 

"seq-aware’ ) 

users_train, items_train, ratings_train, candidates = d21.load_data_ml1100k( 
train_data, num_users, num_items, feedback="implicit”) 

users_test, items_test, ratings_test, test_iter = d21.load_data_m1100k( 
test_data, num_users, num_items, feedback="implicit”) 

train_iter = gluon.data.DataLoader ( 
PRDataset(users_train, items_train, candidates, num_items ), batch_size, 
True, last_batch="rollover”, num_workers=d21.get_dataloader_workers()) 
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We then create and initialize the model. we use a three-layer MLP with constant hidden size 10. 


devices = d21.try_all_gpus() 
net = NeuMF(10, num_users, num_items, nums_hiddens=[10, 10, 10]) 
net.initialize(ctx=devices, force_reinit=True, init=mx.init.Normal(0.01)) 


The following code trains the model. 


lr, num_epochs, wd, optimizer = 0.01, 10, le-5, ‘adam’ 
loss = d21.BPRLoss() 
trainer = gluon.Trainer(net.collect_params(), optimizer, 
{"learning_rate”: lr, 'wd': wd}) 
train_ranking(net, train_iter, test_iter, loss, trainer, None, num_users, 
num_items, num_epochs, devices, evaluate_ranking, candidates) 


train loss 16.982, test hit rate 0.075, test AUC 0.531 
13.3 examples/sec on [gpu(0), gpu(1)] 


1.0 
— test hit rate 
0.8 === test AUC 
0.6 
0.4 


0.2 





Summary 
e Adding nonlinearity to matrix factorization model is beneficial for improving the model ca- 
pability and effectiveness. 


e NeuMF is a combination of matrix factorization and multilayer perceptron. The multilayer 
perceptron takes the concatenation of user and item embeddings as the input. 


Exercises 


e Vary the size of latent factors. How the size of latent factors impact the model performance? 


e Vary the architectures (e.g., number of layers, number of neurons of each layer) of the MLP 
to check the its impact on the performance. 


e Try different optimizers, learning rate and weight decay rate. 


e Try to use hinge loss defined in the last section to optimize this model. 
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Discussions??? 


16.7 Sequence-Aware Recommender Systems 


In previous sections, we abstract the recommendation task as a matrix completion problem with- 
out considering users’ short-term behaviors. In this section, we will introduce a recommendation 
model that takes the sequentially-ordered user interaction logs into account. It is a sequence- 
aware recommender (Quadrana et al., 2018) where the input is an ordered and often timestamped 
list of past user actions. A number of recent literatures have demonstrated the usefulness of in- 
corporating such information in modeling users' temporal behavioral patterns and discovering 
their interest drift. 


The model we will introduce, Caser (Tang & Wang, 2018), short for convolutional sequence embed- 
ding recommendation model, adopts convolutional neural networks capture the dynamic pattern 
influences of users' recent activities. The main component of Caser consists of a horizontal con- 
volutional network and a vertical convolutional network, aiming to uncover the union-level and 
point-level sequence patterns, respectively. Point-level pattern indicates the impact of single item 
in the historical sequence on the target item, while union level pattern implies the influences of 
several previous actions on the subsequent target. For example, buying both milk and butter to- 
gether leads to higher probability of buying flour than just buying one of them. Moreover, users’ 
general interests, or long term preferences are also modeled in the last fully-connected layers, 
resulting in a more comprehensive modeling of user interests. Details of the model are described 
as follows. 


16.7.1 Model Architectures 


In sequence-aware recommendation system, each user is associated with a sequence of some 
items from the item set. Let S” = (Si, ... iSul) denotes the ordered sequence. The goal of Caser is 
to recommend item by considering user general tastes as well as short-term intention. Suppose 
we take the previous L items into consideration, an embedding matrix that represents the former 
interactions for time step t can be constructed: 


EM!) — las: ,, se, sy], (16.7.1) 


where Q € R”** represents item embeddings and q; denotes the i® row. El“) € RLX* can be 
used to infer the transient interest of user u at time-step t. We can view the input matrix E”) as 
an image which is the input of the subsequent two convolutional components. 


The horizontal convolutional layer has d horizontal filters FÍ € R***,1 < j < d,h = {1,..., L}, 
and the vertical convolutional layer has d' vertical filters G? e R’*!,1 < j < d’. After a series of 
convolutional and pool operations, we get the two outputs: 


o = HConv(E”) F) 


o' = VConv(E“") G), (16.7.2) 


where o € R? is the output of horizontal convolutional network and o' € R*“ is the output of verti- 
cal convolutional network. For simplicity, we omit the details of convolution and pool operations. 





20 https://discuss.d21.ai/t/403 
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They are concatenated and fed into a fully-connected neural network layer to get more high-level 
representations. 


z = ¿(Wjo,o']' +b), (16.7.3) 
where W e R**(¢+*’) is the weight matrix and b e R* is the bias. The learned vector z € R* is the 
representation of user’s short-term intent. 


At last, the prediction function combines users’ short-term and general taste together, which is 
defined as: 


Quit = Vi- [Z, Pu] + bi, (16.7.4) 


where V € R"*? is another item embedding matrix. b' e R” is the item specific bias. P e R™** 
is the user embedding matrix for users’ general tastes. p, € R* is the u row of P and v; € R” is 
the ¿Y row of V. 


The model can be learned with BPR or Hinge loss. The architecture of Caser is shown below: 






Concatenate 


Concatenate 


ee eee 


Fig. 16.7.1: Illustration of the Caser Model 


We first import the required libraries. 


from d21 import mxnet as d21 
from mxnet import gluon, np, npx 
from mxnet.gluon import nn 
import mxnet as mx 

import random 


npx.set_np() 
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16.7.2 Model Implementation 


The following code implements the Caser model. It consists of a vertical convolutional layer, a 
horizontal convolutional layer, and a full-connected layer. 


class Caser(nn.Block): 
def __init__(self, num_factors, num_users, num_items, L=5, d=16, 
d_prime=4, drop_ratio=0.05, **xkwargs): 

super(Caser, self).__init__(**kwargs) 

self.P = nn.Embedding(num_users, num_factors) 

self.Q = nn.Embedding(num_items, num_factors) 

self.d_prime, self.d = d_prime, d 

# Vertical convolution layer 

self.conv_v = nn.Conv2D(d_prime, (L, 1), in_channels=1) 

# Horizontal convolution layer 

h= [i + 1 for i in range(L)] 

self.conv_h, self.max_pool = nn.Sequential(), nn.Sequential() 

for aine 

self.conv_h.add(nn.Conv2D(d, (i, num_factors), in_channels=1)) 
self.max_pool.add(nn.MaxPool1D(L - i + 1)) 

# Fully-connected layer 

self.fc1_dim_v, self.fc1_dim_h = d_prime * num_factors, d * len(h) 

self.fc = nn.Dense(in_units=d_prime * num_factors + d x L, 
activation='relu’, units=num_factors) 

self.Q_prime = nn.Embedding(num_items, num_factors * 2) 

self.b = nn.Embedding(num_items, 1) 

self.dropout = nn.Dropout(drop_ratio) 


def forward(self, user_id, seq, item_id): 
item_embs = np.expand_dims(self.Q(seq), 1) 
user_emb = self .P(user_id) 
out, out_h, out_v, out_hs = None, None, None, [] 
if self.d_prime: 
out_v = self.conv_v(item_embs) 
out_v = out_v.reshape(out_v.shape[@], self.fc1_dim_v) 
if self.d: 
for conv, maxp in zip(self.conv_h, self.max_pool): 
conv_out = np.squeeze(npx.relu(conv(item_embs)), axis=3) 
t = maxp(conv_out) 
pool_out = np.squeeze(t, axis=2) 
out_hs.append(pool_out) 
out_h = np.concatenate(out_hs, axis=1) 
out = np.concatenate([out_v, out_h], axis=1) 
z = self.fc(self.dropout (out) ) 
x = np.concatenate([z, user_emb], axis=1) 
q_prime_i = np.squeeze(self.Q_prime(item_id)) 
b = np.squeeze(self.b(item_id)) 
res = (x * q_prime_i).sum(1) + b 
return res 
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16.7.3 Sequential Dataset with Negative Sampling 


To process the sequential interaction data, we need to reimplement the Dataset class. The fol- 
lowing code creates a new dataset class named SeqDataset. In each sample, it outputs the user 
identity, his previous L interacted items as a sequence and the next item he interacts as the target. 
The following figure demonstrates the data loading process for one user. Suppose that this user 
liked 9 movies, we organize these nine movies in chronological order. The latest movie is left out 
as the test item. For the remaining eight movies, we can get three training samples, with each 
sample containing a sequence of five (L = 5) movies and its subsequent item as the target item. 
Negative samples are also included in the Customized dataset. 


Interaction 
Test Item 
Sequence , 
Cu) O O O O R 


y 


A nee nee ee 


Past L Items as Input 


Fig. 16.7.2: Illustration of the data generation process 


class SeqDataset(gluon.data.Dataset): 
def __init__(self, user_ids, item_ids, L, num_users, num_items, 
candidates): 
user_ids, item_ids = np.array(user_ids), np.array(item_ids) 
sort_idx = np.array(sorted(range(len(user_ids)), 
key=lambda k: user_ids[k])) 
u_ids, i_ids = user_ids[sort_idx], item_ids[sort_idx] 
temp, u_ids, self.cand = {}, u_ids.asnumpy(), candidates 
self.all_items = set([i for i in range(num_items) ]) 
[temp.setdefault(u_ids[i], [1).append(i) for i, _ in enumerate(u_ids) ] 
temp = sorted(temp.items(), key=lambda x: x[@]) 
u_ids = np.array([i[l0] for i in temp]) 
idx = np.array(Cil1][0] for i in temp]) 
self.ns = ns = int(sum([c - L if c >= L + 1 else 1 for c 
in np.array([len(i[1]) for i in temp])])) 
self.seq_items = np.zeros((ns, L)) 
self.seq_users = np.zeros(ns, dtype='int32') 
self.seq_tgt = np.zeros((ns, 1)) 
self.test_seq = np.zeros((num_users, L)) 
test_users, _uid = np.empty(num_users), None 
for i, (uid, i_seq) in enumerate(self._seq(u_ids, i_ids, idx, L + 1)): 
if uid != _uid: 
self.test_seqluid][:] = i_seq[-L:] 
test_users[uid], _uid = uid, uid 
self.seg_tgt[il[:] = i_seg[-1:] 
self.seq_itemsLil[:], self.seq_users[i] = i_seqL:L], uid 


def _win(self, tensor, window_size, step_size=1): 
(continues on next page) 
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(continued from previous page) 


if len(tensor) - window_size >= 0: 
for i in range(len(tensor), 0, - step_size): 
if i - window_size >= 0: 
yield tensorli - window_size:i] 
else: 
break 
else: 
yield tensor 


def _seq(self, u_ids, i_ids, idx, max_len): 
for i in range(len(idx)): 
stop_idx = None if i >= len(idx) - 1 else int(idx[i + 1]) 
for s in self._win(i_ids[int(idx[i]):stop_idx], max_len): 
yield (int(u_ids[i]), s) 


def __len__(self): 
return self.ns 


def __getitem__(self, idx): 
neg = list(self.all_items - set(self.candLint(self.seq_users[idx])])) 
i = random.randint(0, len(neg) - 1) 
return (self.seq_usersLidx], self.seq_items[idx], self.seq_tgtLidx], 
neg[i]) 


16.7.4 Load the MovieLens 100K dataset 


Afterwards, we read and split the MovieLens 100K dataset in sequence-aware mode and load the 
training data with sequential dataloader implemented above. 


TARGET_NUM, L, batch_size = 1, 5, 4096 
df, num_users, num_items = d21.read_data_m1100k() 
train_data, test_data = d21.split_data_ml100k(df, num_users, num_items, 
"seq-aware’ ) 
users_train, items_train, ratings_train, candidates = d21.load_data_m1100k( 
train_data, num_users, num_items, feedback="implicit”) 
users_test, items_test, ratings_test, test_iter = d21.load_data_m1100k( 
test_data, num_users, num_items, feedback="implicit”) 
train_seq_data = SeqDataset(users_train, items_train, L, num_users, 
num_items, candidates) 
train_iter = gluon.data.DataLoader(train_seq_data, batch_size, True, 
last_batch="rollover”, 
num_workers=d21.get_dataloader_workers()) 
test_seq_iter = train_seq_data.test_seq 
train_seq_data[0] 


(array(0, dtype=int32), 

array([241., 170., 110., 255., 4.1), 
array([101.]), 

977) 


The training data structure is shown above. The first element is the user identity, the next list 
indicates the last five items this user liked, and the last element is the item this user liked after the 
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five items. 


16.7.5 Train the Model 


Now, let us train the model. We use the same setting as NeuMF, including learning rate, optimizer, 
and k, in the last section so that the results are comparable. 


devices = d21.try_all_gpus() 

net = Caser(10, num_users, num_items, L) 

net.initialize(ctx=devices, force_reinit=True, init=mx.init.Normal(0.01)) 

lr, num_epochs, wd, optimizer = 0.04, 8, le-5, ‘adam’ 

loss = d21.BPRLoss() 

trainer = gluon.Trainer(net.collect_params(), optimizer, 
{"learning_rate”: lr, 'wd': wd}) 


d21.train_ranking(net, train_iter, test_iter, loss, trainer, test_seq_iter, 
num_users, num_items, num_epochs, devices, 
d21.evaluate_ranking, candidates, eval_step=1) 


train loss 0.763, test hit rate 0.394, test AUC 0.756 
29.1 examples/sec on [gpu(@), gpu(1)] 


1.0 
0.8 
0.6 


0.4 


0.2 — test hit rate 
==- test AUC 
0.0 





Summary 
e Inferring a user's short-term and long-term interests can make prediction of the next item 
that he preferred more effectively. 


e Convolutional neural networks can be utilized to capture users’ short-term interests from 
sequential interactions. 
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Exercises 


e Conduct an ablation study by removing one of the horizontal and vertical convolutional net- 
works, which component is the more important ? 


+ Vary the hyperparameter L. Does longer historical interactions bring higher accuracy? 


+ Apart from the sequence-aware recommendation task we introduced above, there is another 
type of sequence-aware recommendation task called session-based recommendation (Hi- 
dasi et al., 2015). Can you explain the differences between these two tasks? 


Discussions?*! 


16.8 Feature-Rich Recommender Systems 


Interaction data is the most basic indication of users' preferences and interests. It plays a critical 
role in former introduced models. Yet, interaction data is usually extremely sparse and can be 
noisy at times. To address this issue, we can integrate side information such as features of items, 
profiles of users, and even in which context that the interaction occurred into the recommenda- 
tion model. Utilizing these features are helpful in making recommendations in that these features 
can be an effective predictor of users interests especially when interaction data is lacking. As such, 
it is essential for recommendation models also have the capability to deal with those features and 
give the model some content/context awareness. To demonstrate this type of recommendation 
models, we introduce another task on click-through rate (CTR) for online advertisement recom- 
mendations (McMahan et al., 2013) and present an anonymous advertising data. Targeted adver- 
tisement services have attracted widespread attention and are often framed as recommendation 
engines. Recommending advertisements that match users’ personal taste and interest is impor- 
tant for click-through rate improvement. 


Digital marketers use online advertising to display advertisements to customers. Click-through 
rate is a metric that measures the number of clicks advertisers receive on their ads per number of 
impressions and itis expressed as a percentage calculated with the formula: 


HClick 
CTR= 24 100%. (16.8.1) 
Impressions 


Click-through rate is an important signal thatindicates the effectiveness of prediction algorithms. 
Click-through rate prediction is atask of predicting the likelihood that something on a website will 
be clicked. Models on CTR prediction can not only be employed in targeted advertising systems 
but also in general item (e.g., movies, news, products) recommender systems, email campaigns, 
and even search engines. Itis also closely related to user satisfaction, conversion rate, and can be 
helpful in setting campaign goals as it can help advertisers to set realistic expectations. 


from collections import defaultdict 
from d21 import mxnet as d21 

from mxnet import gluon, np 

import os 





1 https://discuss.d21.ai/t/404 
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16.8.1 An Online Advertising Dataset 


With the considerable advancements of Internet and mobile technology, online advertising has 
become an important income resource and generates vast majority of revenue in the Internet 
industry. It is important to display relevant advertisements or advertisements that pique users’ 
interests so that casual visitors can be converted into paying customers. The dataset we intro- 
duced is an online advertising dataset. It consists of 34 fields, with the first column representing 
the target variable that indicates if an ad was clicked (1) or not (0). All the other columns are 
categorical features. The columns might represent the advertisement id, site or application id, 
device id, time, user profiles and so on. The real semantics of the features are undisclosed due to 
anonymization and privacy concern. 


The following code downloads the dataset from our server and saves itinto the local data folder. 


#@save 
d21.DATA_HUB['ctr'] = (d21.DATA_URL + 'ctr.zip', 
"e18327c48c8e8e5c23da714dd614e390d369843f') 


data_dir = d21.download_extract('ctr’) 


Downloading ../data/ctr.zip from http://d21-data.s3-accelerate.amazonaws.com/ctr.zip... 


There are a training set and a test set, consisting of 15000 and 3000 samples/lines, respectively. 


16.8.2 Dataset Wrapper 


For the convenience of data loading, we implement a CTRDataset which loads the advertising 
dataset from the CSV file and can be used by DataLoader. 


#@save 
class CTRDataset(gluon.data.Dataset): 
def __init__(self, data_path, feat_mapper=None, defaults=None, 
min_threshold=4, num_feat=34): 
self .NUM_FEATS, self.count, self.data = num_feat, 2, {} 
feat_cnts = defaultdict(lambda: defaultdict(int)) 
self.feat_mapper, self.defaults = feat_mapper, defaults 
self.field_dims = np.zeros(self.NUM_FEATS, dtype=np. int64) 
with open(data_path) as f: 
for line in f: 
instance = {} 
values = line.rstrip('n').split('1t'> 
if len(values) != self.NUM_FEATS + 1: 
continue 
label = np.float32([0, @]) 
label[Lint(values[0])] = 1 
instance['y'] = [np.float32(values[0])] 
for i in range(1, self.NUM_FEATS + 1): 
feat_cnts[i][values[il]] += 1 
instance.setdefault('x', []).append(values[i]) 
self.data[self.count] = instance 
self.count = self.count + 1 
if self.feat_mapper is None and self.defaults is None: 


(continues on next page) 
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feat_mapper = {i: {feat for feat, c in cnt.items() if c >= 
min_threshold} for i, cnt in feat_cnts.items()} 
self.feat_mapper = (fi: (feat: idx for idx, feat in enumerate(cnt)} 
for i, cnt in feat_mapper.items()} 
self.defaults = (fi: len(cnt) for i, cnt in feat_mapper.items() } 
for i, fm in self.feat_mapper.items(): 
self.field_dims[i - 1] = len(fm) + 1 
self.offsets = np.array((0, *np.cumsum(self.field_dims) . asnumpy() 
Es=110) 


def __len__(self): 
return self.count 


def __getitem__(self, idx): 
feat = np.array([self.feat_mapper[i + 1].get(v, self .defaults[i + 1]) 
for i, v in enumerate(self.data[lidxJ]['x'"1)1) 
return feat + self.offsets, self.data[lidx]['y'] 


The following example loads the training data and print out the first record. 


train_data = CTRDataset(os.path.join(data_dir, 'train.csv')) 
train_data[0] 


(array([ 143., 145., 227., 238., 957., 1250., 1471., 1566., 1624., 
1927., 2008., 2061., 2261., 2304., 2305., 2360., 2745., 2746., 
2747., 2748., 2892., 2988., 3165., 3171., 3194., 3195., 3206., 
3655., 3687., 3696., 3725., 3742., 3775., 3796.1), 

[1:01) 


As can be seen, all the 34 fields are categorical features. Each value represents the one-hot index 
of the corresponding entry. The label 0 means that it is not clicked. This CTRDataset can also 
be used to load other datasets such as the Criteo display advertising challenge Dataset? and the 
Avazu click-through rate prediction Dataset?%, 


Summary 
e Click-through rate is an important metric that is used to measure the effectiveness of adver- 
tising systems and recommender systems. 


e Click-through rate prediction is usually converted to a binary classification problem. The 
target is to predict whether an ad/item will be clicked or not based on given features. 





232 https://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ 
233 https: //www.kaggle.com/c/avazu-ctr-prediction 
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Exercises 


e Can you load the Criteo and Avazu dataset with the provided CTRDataset. It is worth noting 
that the Criteo dataset consisting of real-valued features so you may have to revise the code 
a bit. 


Discussions2** 


16.9 Factorization Machines 


Factorization machines (FM) (Rendle, 2010), proposed by Steffen Rendle in 2010, is a supervised 
algorithm that can be used for classification, regression, and ranking tasks. It quickly took notice 
and became a popular and impactful method for making predictions and recommendations. Par- 
ticularly, it is a generalization of the linear regression model and the matrix factorization model. 
Moreover, it is reminiscent of support vector machines with a polynomial kernel. The strengths of 
factorization machines over the linear regression and matrix factorization are: (1) it can model y- 
way variable interactions, where x is the number of polynomial order and is usually set to two. (2) 
A fast optimization algorithm associated with factorization machines can reduce the polynomial 
computation time to linear complexity, making it extremely efficient especially for high dimen- 
sional sparse inputs. For these reasons, factorization machines are widely employed in modern 
advertisement and products recommendations. The technical details and implementations are 
described below. 


16.9.1 2-Way Factorization Machines 


Formally, let » € R? denote the feature vectors of one sample, and y denote the corresponding 
label which can be real-valued label or class label such as binary class “click/non-click”. The model 
for a factorization machine of degree two is defined as: 


d d d 
g(x) = Wo + Y wisi + y `> (Vi, Vi) iTi (16.9.1) 
i=1 


i=1 j=i+1 


where wo € R is the global bias; w € R? denotes the weights of the i-th variable; V € R?** rep- 
resents the feature embeddings; v; represents the it row of V; k is the dimensionality of latent 
factors; (-, -) is the dot product of two vectors. (v;,v;) model the interaction between the i and 
j™ feature. Some feature interactions can be easily understood so they can be designed by ex- 
perts. However, most other feature interactions are hidden in data and difficult to identify. So 
modeling feature interactions automatically can greatly reduce the efforts in feature engineering. 
It is obvious that the first two terms correspond to the linear regression model and the last term is 
an extension of the matrix factorization model. If the feature i represents an item and the feature 
j represents a user, the third term is exactly the dot product between user and item embeddings. 
It is worth noting that FM can also generalize to higher orders (degree > 2). Nevertheless, the 
numerical stability might weaken the generalization. 





34 https://discuss.d21.ai/t/405 
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16.9.2 An Efficient Optimization Criterion 


Optimizing the factorization machines in a straight forward method leads to a complexity of 
O(kd?) as all pairwise interactions require to be computed. To solve this inefficiency problem, 
we can reorganize the third term of FM which could greatly reduce the computation cost, lead- 
ing to a linear time complexity (O(kd)). The reformulation of the pairwise interaction term is as 
follows: 


d d 
D > (Vi, Vi Did 
i 1 
d d E 
NN (vi vijzizj- z X Wi, Vs) wis 


i=1 


=1 
d d k d k 
O È y Vi IV GILL — y y Vi Vi Xizi) (16.9.2) 


i=1 l=1 


i= 
k d d 
a 9 
> y pl pva) 2 Mtz) — > Vit) 
i=1 
k 
2.3 
i x Vitti) 25 v;¡2;) 
i=1 


With this reformulation, the model complexity are decreased greatly. Moreover, for sparse fea- 
tures, only non-zero elements needs to be computed so that the overall complexity is linear to the 
number of non-zero features. 


a. 

ll 
Ei 
~ 
pas 


To learn the FM model, we can use the MSE loss for regression task, the cross entropy loss for 
classification tasks, and the BPR loss for ranking task. Standard optimizers such as SGD and Adam 
are viable for optimization. 


from d21 import mxnet as d21 

from mxnet import init, gluon, np, npx 
from mxnet.gluon import nn 

import os 


npx.set_np() 


16.9.3 Model Implementation 


The following code implement the factorization machines. It is clear to see that FM consists a 
linear regression block and an efficient feature interaction block. We apply a sigmoid function 
over the final score since we treat the CTR prediction as a classification task. 


class FM(nn.Block): 
def __init__(self, field_dims, num_factors): 
super(FM, self).__init__Q 
num_inputs = int(sum(field_dims)) 
self .embedding = nn.Embedding(num_inputs, num_factors) 
self.fc = nn.Embedding(num_inputs, 1) 
self.linear_layer = nn.Dense(1, use_bias=True) 


(continues on next page) 





788 Chapter 16. Recommender Systems 


(continued from previous page) 


def forward(self, x): 
square_of_sum = np.sum(self.embedding(x), axis=1) ** 2 
sum_of_square = np.sum(self.embedding(x) ** 2, axis=1) 
x = self.linear_layer(self.fc(x).sum(1)) \ 
+ 0.5 x (square_of_sum - sum_of_square).sum(1, keepdims=True) 
X = npx.sigmoid(x) 
return x 


16.9.4 Load the Advertising Dataset 


We use the CTR data wrapper from the last section to load the online advertising dataset. 


batch_size = 2048 
data_dir = d21.download_extract('ctr’) 
train_data = d21.CTRDataset(os.path.join(data_dir, 'train.csv')) 
test_data = d21.CTRDataset(os.path.join(data_dir, 'test.csv'), 
feat_mapper=train_data.feat_mapper, 
defaults=train_data.defaults) 
train_iter = gluon.data.DataLoader( 
train_data, shuffle=True, last_batch='rollover', batch_size=batch_size, 
num_workers=d21.get_dataloader_workers()) 
test_iter = gluon.data.DataLoader( 
test_data, shuffle=False, last_batch='rollover', batch_size=batch_size, 
num_workers=d21.get_dataloader_workers()) 


16.9.5 Train the Model 


Afterwards, we train the model. The learning rate is set to 0.02 and the embedding size is set to 20 
by default. The Adam optimizer and the SigmoidBinaryCrossEntropyLoss loss are used for model 
training. 


devices = d21.try_all_gpus() 
net = FM(train_data.field_dims, num_factors=20) 
net.initialize(init.Xavier(), ctx=devices) 
Ir, num_epochs, optimizer = 0.02, 30, ‘adam’ 
trainer = gluon.Trainer(net.collect_params(), optimizer, 
{'learning_rate': 1r)) 
loss = gluon.loss.SigmoidBinaryCrossEntropyLoss() 
d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.505, train acc 0.256, test acc 0.248 
174097.5 examples/sec on [gpu(0), gpu(1)] 





16.9. Factorization Machines 789 


— train loss 
=== train acc 
—-- test acc 





epoch 


Summary 


e FM is a general framework that can be applied on a variety of tasks such as regression, clas- 
sification, and ranking. 


e Feature interaction/crossing is important for prediction tasks and the 2-way interaction can 
be efficiently modeled with FM. 


Exercises 


e Can you test FM on other dataset such as Avazu, MovieLens, and Criteo datasets? 


e Vary the embedding size to check its impact on performance, can you observe a similar 
pattern as that of matrix factorization? 


Discussions2*> 


16.10 Deep Factorization Machines 


Learning effective feature combinations is critical to the success of click-through rate prediction 
task. Factorization machines model feature interactions in a linear paradigm (e.g., bilinear in- 
teractions). This is often insufficient for real-world data where inherent feature crossing struc- 
tures are usually very complex and nonlinear. What's worse, second-order feature interactions 
are generally used in factorization machines in practice. Modeling higher degrees of feature com- 
binations with factorization machines is possible theoretically but it is usually not adopted due to 
numerical instability and high computational complexity. 


One effective solution is using deep neural networks. Deep neural networks are powerful in fea- 
ture representation learning and have the potential to learn sophisticated feature interactions. 
As such, it is natural to integrate deep neural networks to factorization machines. Adding non- 
linear transformation layers to factorization machines gives it the capability to model both low- 
order feature combinations and high-order feature combinations. Moreover, non-linear inherent 
structures from inputs can also be captured with deep neural networks. In this section, we will 





25 https://discuss.d21.ai/t/406 
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introduce a representative model named deep factorization machines (DeepFM) (Guo et al., 2017) 
which combine FM and deep neural networks. 


16.10.1 Model Architectures 


DeepFM consists of an FM component and a deep component which are integrated in a paral- 
lel structure. The FM component is the same as the 2-way factorization machines which is used 
to model the low-order feature interactions. The deep component is a multi-layered perceptron 
that is used to capture high-order feature interactions and nonlinearities. These two components 
share the same inputs/embeddings and their outputs are summed up as the final prediction. It 
is worth pointing out that the spirit of DeepFM resembles that of the Wide & Deep architecture 
which can capture both memorization and generalization. The advantages of DeepFM over the 
Wide & Deep model is that it reduces the effort of hand-crafted feature engineering by identifying 
feature combinations automatically. 


We omit the description of the FM component for brevity and denote the output as (FM), Readers 


are referred to the last section for more details. Let e; € R* denote the latent feature vector of the 
i field. The input of the deep component is the concatenation of the dense embeddings of all 
fields that are looked up with the sparse categorical feature input, denoted as: 


ZO = ley, e9,...,e,], (16.10.1) 
where f is the number of fields. It is then fed into the following neural network: 
z0 = a(wOz'=D + pO), (16.10.2) 


where a is the activation function. W, and b; are the weight and bias at the /* layer. Let ynyy 
denote the output of the prediction. The ultimate prediction of DeepFM is the summation of the 
outputs from both FM and DNN. So we have: 

g = o (9E + PNN) (16.10.3) 


y 


where o is the sigmoid function. The architecture of DeepFM is illustrated below. 
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Itis worth noting that DeepFM is not the only way to combine deep neural networks with FM. We 
can also add nonlinear layers over the feature interactions (He & Chua, 2017). 


from d21 import mxnet as d21 

from mxnet import init, gluon, np, npx 
from mxnet.gluon import nn 

import os 


npx.set_np() 


16.10.2 Implemenation of DeepFM 


The implementation of DeepFM is similar to that of FM. We keep the FM part unchanged and use 
an MLP block with relu as the activation function. Dropout is also used to regularize the model. 
The number of neurons of the MLP can be adjusted with the mlp_dims hyperparameter. 


class DeepFM(nn.Block): 
def __init__(self, field_dims, num_factors, mlp_dims, drop_rate=0.1): 

super(DeepFM, self).__init__() 

num_inputs = int(sum(field_dims)) 

self.embedding = nn.Embedding(num_inputs, num_factors) 

self.fc = nn.Embedding(num_inputs, 1) 

self.linear_layer = nn.Dense(1, use_bias=True) 

input_dim = self.embed_output_dim = len(field_dims) * num_factors 

self.mlp = nn.Sequential() 

for dim in mlp_dims: 
self.mlp.add(nn.Dense(dim, 'relu', True, in_units=input_dim)) 
self .mlp.add(nn.Dropout (rate=drop_rate) ) 


(continues on next page) 
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(continued from previous page) 


input_dim = dim 
self .mlp.add(nn.Dense(in_units=input_dim, units=1)) 


def forward(self, x): 

embed_x = self .embedding(x) 

square_of_sum = np.sum(embed_x, axis=1) ** 2 

sum_of_square = np.sum(embed_x ** 2, axis=1) 

inputs = np.reshape(embed_x, (-1, self.embed_output_dim)) 

x = self.linear_layer(self.fc(x).sum(1)) \ 
+ 0.5 * (square_of_sum - sum_of_square).sum(1, keepdims=True) \ 
+ self.mlp(inputs) 

x = npx.sigmoid(x) 

return x 


16.10.3 Training and Evaluating the Model 


The data loading process is the same as that of FM. We set the MLP component of DeepFM to a 
three-layered dense network with the a pyramid structure (30-20-10). All other hyperparameters 
remain the same as FM. 


batch_size = 2048 
data_dir = d21.download_extract('ctr’) 
train_data = d21.CTRDataset(os.path.join(data_dir, 'train.csv')) 
test_data = d21.CTRDataset(os.path.join(data_dir, 'test.csv'), 
feat_mapper=train_data.feat_mapper, 
defaults=train_data.defaults) 
field_dims = train_data.field_dims 
train_iter = gluon.data.DataLoader( 
train_data, shuffle=True, last_batch='rollover', batch_size=batch_size, 
num_workers=d21.get_dataloader_workers()) 
test_iter = gluon.data.DataLoader( 
test_data, shuffle=False, last_batch='rollover', batch_size=batch_size, 
num_workers=d21.get_dataloader_workers()) 
devices = d21.try_all_gpus() 
net = DeepFM(field_dims, num_factors=10, mlp_dims=[30, 20, 10]) 
net.initialize(init.Xavier(), ctx=devices) 
lr, num_epochs, optimizer = 0.01, 30, ‘adam’ 
trainer = gluon.Trainer(net.collect_params(), optimizer, 
{'learning_rate': 1r)) 
loss = gluon.loss.SigmoidBinaryCrossEntropyLoss() 
d21.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) 


loss 0.510, train acc 0.471, test acc 0.470 
94320.1 examples/sec on [gpu(@), gpu(1)] 
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Compared with FM, DeepFM converges faster and achieves better performance. 


Summary 


+ Integrating neural networks to FM enables it to model complex and high-order interactions. 


+ DeepFM outperforms the original FM on the advertising dataset. 


Exercises 


e Vary the structure of the MLP to check its impact on model performance. 
e Change the dataset to Criteo and compare it with the original FM model. 


Discussions?% 





2% https://discuss.d21.ai/t/407 
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17 Generative Adversarial Networks 


17.1 Generative Adversarial Networks 


Throughout most of this book, we have talked about how to make predictions. In some form or 
another, we used deep neural networks learned mappings from data examples to labels. This kind 
of learning is called discriminative learning, as in, wed like to be able to discriminate between 
photos cats and photos of dogs. Classifiers and regressors are both examples of discriminative 
learning. And neural networks trained by backpropagation have upended everything we thought 
we knew about discriminative learning on large complicated datasets. Classification accuracies 
on high-res images has gone from useless to human-level (with some caveats) in just 5-6 years. We 
will spare you another spiel about all the other discriminative tasks where deep neural networks 
do astoundingly well. 


But there is more to machine learning than just solving discriminative tasks. For example, given 
a large dataset, without any labels, we might want to learn a model that concisely captures the 
characteristics of this data. Given such a model, we could sample synthetic data examples that 
resemble the distribution of the training data. For example, given a large corpus of photographs 
of faces, we might want to be able to generate a new photorealistic image that looks like it might 
plausibly have come from the same dataset. This kind of learning is called generative modeling. 


Until recently, we had no method that could synthesize novel photorealistic images. But the suc- 
cess of deep neural networks for discriminative learning opened up new possibilities. One big 
trend over the last three years has been the application of discriminative deep nets to overcome 
challenges in problems that we do not generally think of as supervised learning problems. The 
recurrent neural network language models are one example of using a discriminative network 
(trained to predict the next character) that once trained can act as a generative model. 


In 2014, a breakthrough paper introduced Generative adversarial networks (GANs) (Goodfellow 
et al., 2014), a clever new way to leverage the power of discriminative models to get good gener- 
ative models. At their heart, GANs rely on the idea that a data generator is good if we cannot tell 
fake data apart from real data. In statistics, this is called a two-sample test - a test to answer the 
question whether datasets X = {x1,...,v,} and X’ = {x/,...,2/,} were drawn from the same dis- 
tribution. The main difference between most statistics papers and GANs is that the latter use this 
idea in a constructive way. In other words, rather than just training a model to say “hey, these two 
datasets do not look like they came from the same distribution”, they use the two-sample test???” to 
provide training signals to a generative model. This allows us to improve the data generator until 
it generates something that resembles the real data. At the very least, it needs to fool the classifier. 
Even if our classifier is a state of the art deep neural network. 





227 https://en.wikipedia. org/wiki/Two-sample_hypothesis_testing 
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Fig. 17.1.1: Generative Adversarial Networks 


The GAN architecture is illustrated in Fig. 17.1.1. As you can see, there are two pieces in GAN 
architecture - first off, we need a device (say, a deep network but it really could be anything, such 
as a game rendering engine) that might potentially be able to generate data that looks just like the 
real thing. If we are dealing with images, this needs to generate images. If we are dealing with 
speech, it needs to generate audio sequences, and so on. We call this the generator network. The 
second component is the discriminator network. It attempts to distinguish fake and real data from 
each other. Both networks are in competition with each other. The generator network attempts 
to fool the discriminator network. At that point, the discriminator network adapts to the new fake 
data. This information, in turn is used to improve the generator network, and so on. 


The discriminator is a binary classifier to distinguish if the input x is real (from real data) or fake 
(from the generator). Typically, the discriminator outputs a scalar prediction o € R for input x, 
such as using a dense layer with hidden size 1, and then applies sigmoid function to obtain the 
predicted probability D(x) = 1/(1 +e ~°). Assume the label y for the true data is 1 and 0 for the 
fake data. We train the discriminator to minimize the cross-entropy loss, i.e., 


min{—ylog D(x) — (1 — y)log(1 — D(x))}, (17.1.1) 


For the generator, it first draws some parameter z € R’ from a source of randomness, e.g., a 
normal distribution z ~ N(0, 1). We often call z as the latent variable. It then applies a function 
to generate x’ = G(z). The goal of the generator is to fool the discriminator to classify x’ = G(z) 
as true data, 1.e., we want D(G(z)) = 1. In other words, for a given discriminator D, we update 
the parameters of the generator G to maximize the cross-entropy loss when y = 0, 1.e., 


max{—(1— y) log(1 — D(G(z)))} = max{— log(1 — D(G(z)))}. (17.1.2) 


If the generator does a perfect job, then D(x’) ~ 1 so the above loss near 0, which results the 
gradients are too small to make a good progress for the discriminator. So commonly we minimize 
the following loss: 


min{—ylog(D(G(z)))} = min{—log(D(G(z)))}, (17.1.3) 


which is just feed x’ = G(z) into the discriminator but giving label y = 1. 


To sum up, D and G are playing a “minimax” game with the comprehensive objective function: 


minpmaxg{—EzxpatalogD (x) — Ez.Noiselog(1 — D(G(z)))}- (17.1.4) 
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Many of the GANs applications are in the context of images. As a demonstration purpose, we 
are going to content ourselves with fitting a much simpler distribution first. We will illustrate 
what happens if we use GANs to build the world's most inefficient estimator of parameters for a 
Gaussian. Let us get started. 


%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, gluon, init, np, npx 
from mxnet.gluon import nn 

npx.set_np() 


17.1.1 Generate some “real” data 


Since this is going to be the world’s lamest example, we simply generate data drawn from a Gaus- 
sian. 


= np.random.normal(0.0, 1, (1000, 2)) 
np.array([[1, 2], [-0.1, @.5]]) 
np.array([1, 2]) 

ata = np.dot(X, A) + b 


Qo }> x< 
I 


Let us see what we got. This should be a Gaussian shifted in some rather arbitrary way with mean 
b and covariance matrix AT A. 


d2l.set_figsize() 
d2l.plt.scatter(data[:100, (0)].asnumpy(), data[:100, (1)].asnumpy()) 
print(f'The covariance matrix is\n{np.dot(A.T, A))') 


The covariance matrix is 
[[1.01 1.95] 
[9542518 





batch_size = 8 
data_iter = d21.load_array((data,), batch_size) 
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17.1.2 Generator 


Our generator network will be the simplest network possible - a single layer linear model. This 
is since we will be driving that linear network with a Gaussian data generator. Hence, it literally 
only needs to learn the parameters to fake things perfectly. 


net_G = nn.Sequential() 
net_G.add(nn.Dense(2)) 


17.1.3 Discriminator 


For the discriminator we will be a bit more discriminating: we will use an MLP with 3 layers to 
make things a bit more interesting. 


net_D = nn.Sequential() 

net_D.add(nn.Dense(5, activation='tanh’), 
nn.Dense(3, activation='tanh'), 
nn.Dense(1)) 


17.1.4 Training 


First we define a function to update the discriminator. 


#@save 
def update_D(X, Z, net_D, net_G, loss, trainer_D): 
""*"Update discriminator.”"" 
batch_size = X.shaper0] 
ones = np.ones((batch_size,), ctx=X.ctx) 
zeros = np.zeros((batch_size,), ctx=X.ctx) 
with autograd.record(): 
real_Y = net_D(X) 
fake_X = net_G(Z) 
# Do not need to compute gradient for ‘net_G*‘, detach it from 
# computing gradients. 
fake_Y = net_D(fake_X.detach()) 
loss_D = (loss(real_Y, ones) + loss(fake_Y, zeros)) / 2 
loss_D.backward() 
trainer_D.step(batch_size) 
return float(loss_D.sum()) 


The generator is updated similarly. Here we reuse the cross-entropy loss but change the label of 
the fake data from 0 to 1. 


#@save 
def update_G(Z, net_D, net_G, loss, trainer_G): 
"""Update generator."”"" 
batch_size = Z.shape[Q] 
ones = np.ones((batch_size,), ctx=Z.ctx) 
with autograd.record(): 
# We could reuse ‘fake_X* from ‘update_D* to save computation 
fake_X = net_G(Z) 


(continues on next page) 
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# Recomputing 'fake_Y' is needed since 'net_D' is changed 
fake_Y = net_D(fake_X) 
loss_G = loss(fake_Y, ones) 

loss_G.backward() 

trainer_G.step(batch_size) 

return float(loss_G.sum()) 


(continued from previous page) 


Both the discriminator and the generator performs a binary logistic regression with the cross- 
entropy loss. We use Adam to smooth the training process. In each iteration, we first update the 


discriminator and then the generator. We visualize both losses and generated examples. 


def train(net_D, net_G, data_iter, num_epochs, 1r_D, 1r_G, latent_dim, data): 


loss = gluon.loss.SigmoidBCELoss() 
net_D.initialize(init=init.Normal(0.02), force_reinit=True) 
net_G.initialize(init=init.Normal(0.02), force_reinit=True) 
trainer_D = gluon.Trainer(net_D.collect_params() , 

‘adam’, {'learning_rate’: 1r_D}) 
trainer_G = gluon.Trainer(net_G.collect_params() , 

‘adam’, {'learning_rate’: 1r_G}) 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 


xlim=[1, num_epochs], nrows=2, figsize=(5, 5), 


legend=['discriminator', 'generator']) 
animator. fig.subplots_adjust (hspace=0. 3) 
for epoch in range(num_epochs): 
# Train one epoch 
timer = d21.Timer() 


metric = d21.Accumulator(3) + loss_D, loss_G, num_examples 


for X in data_iter: 
batch_size = X.shaper0] 


Z = np.random.normal(0, 1, size=(batch_size, latent_dim)) 
metric.add(update_D(X, Z, net_D, net_G, loss, trainer_D), 


update_G(Z, net_D, net_G, loss, trainer_G), 
batch_size) 
# Visualize generated examples 
Z = np.random.normal(0, 1, size=(100, latent_dim)) 
fake_X = net_G(Z) .asnumpy() 
animator. axes[1].cla() 
animator.axes[1].scatter(data[:, 0], data[:, 1]) 
animator.axes[1].scatter(fake_X[:, 0], fake_X[:, 1]) 
animator.axes[1].legend(['real’, 'generated']) 
# Show the losses 
loss_D, loss_G = metric[0]/metric[2], metric[1]/metric[2] 
animator.add(epoch + 1, (loss_D, loss_G)) 
print(f'loss_D {loss_D:.3f}, loss_G {loss_G: .3f}, ' 
f'(metric[2] / timer.stop():.1f} examples/sec’) 


Now we specify the hyperparameters to fit the Gaussian distribution. 


Ir_D, 1r_G, latent_dim, num_epochs = 0.05, 0.005, 2, 20 
train(net_D, net_G, data_iter, num_epochs, 1r_D, 1r_G, latent_dim, 
data[:100].asnumpy()) 





loss_D 0.693, loss_G 0.693, 570.3 examples/sec 
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Summary 


e Generative adversarial networks (GANs) composes of two deep networks, the generator and 
the discriminator. 


° The generator generates the image as much closer to the true image as possible to fool the 
discriminator, via maximizing the cross-entropy loss, i.e., max log(D(x’)). 


* The discriminator tries to distinguish the generated images from the true images, via mini- 
mizing the cross-entropy loss, i.e., min —ylog D(x) — (1 — y) log(1 — D(x)). 


Exercises 


+ Does an equilibrium exist where the generator wins, i.e. the discriminator ends up unable 
to distinguish the two distributions on finite samples? 


Discussions?% 





238 https: //discuss.d21.ai/t/408 
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17.2 Deep Convolutional Generative Adversarial Networks 


In Section 17.1, we introduced the basic ideas behind how GANs work. We showed that they can 
draw samples from some simple, easy-to-sample distribution, like a uniform or normal distribu- 
tion, and transform them into samples that appear to match the distribution of some dataset. And 
while our example of matching a 2D Gaussian distribution got the point across, itis not especially 
exciting. 


In this section, we will demonstrate how you can use GANSs to generate photorealistic images. We 
will be basing our models on the deep convolutional GANs (DCGAN) introduced in (Radford et 
al., 2015). We will borrow the convolutional architecture that have proven so successful for dis- 
criminative computer vision problems and show how via GANs, they can be leveraged to generate 
photorealistic images. 


from mxnet import gluon, init, np, npx 
from mxnet.gluon import nn 
from d21 import mxnet as d21 


npx.set_np() 


17.2.1 The Pokemon Dataset 


The dataset we will use is a collection of Pokemon sprites obtained from pokemondb*”’. First 
download, extract and load this dataset. 


#@save 
d21.DATA_HUBL’pokemon’] = (d21.DATA_URL + 'pokemon.zip’, 
"c065c0e2593b8b161a2d7873e42418bf6a21106c') 


data_dir = d21.download_extract(' pokemon’) 
pokemon = gluon.data.vision.datasets.ImageFolderDataset(data_dir) 


Downloading ../data/pokemon.zip from http://d21-data.s3-accelerate.amazonaws.com/pokemon. zip. 


Sia 


We resize each image into 64 x 64. The ToTensor transformation will project the pixel value into 
(0, 1], while our generator will use the tanh function to obtain outputs in [—1,1]. Therefore we 
normalize the data with 0.5 mean and 0.5 standard deviation to match the value range. 


batch_size = 256 

transformer = gluon.data.vision.transforms.Compose([ 
gluon.data. vision. transforms.Resize(64), 
gluon.data. vision. transforms. ToTensor(), 
gluon.data.vision.transforms.Normalize(0.5, 0.5) 

1) 

data_iter = gluon.data.DataLoader ( 
pokemon.transform_first(transformer), batch_size=batch_size, 
shuffle=True, num_workers=d21.get_dataloader_workers()) 


Let us visualize the first 20 images. 





222 https://pokemondb.net/sprites 
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d21.set_figsize((4, 4)) 
for X, y in data_iter: 


imgs = XL0:20,:,:,:].transpose(0, 2, 3, 1)/2+0.5 
d21.show_images(imgs, num_rows=4, num_cols=5) 
break 





17.2.2 The Generator 


The generator needs to map the noise variable z € R?, a length-d vector, to a RGB image with width 
and height to be 64 x 64. In Section 13.11 we introduced the fully convolutional network that uses 
transposed convolution layer (refer to Section 13.10) to enlarge input size. The basic block of the 
generator contains a transposed convolution layer followed by the batch normalization and ReLU 
activation. 


class G_block(nn.Block): 
def __init__(self, channels, kernel_size=4, 
strides=2, padding=1, **kwargs): 
super(G_block, self).__init__(**kwargs) 
self.conv2d_trans = nn.Conv2DTranspose( 
channels, kernel_size, strides, padding, use_bias=False) 

self.batch_norm = nn.BatchNorm() 
self.activation = nn.Activation('relu') 


(continues on next page) 
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(continued from previous page) 


def forward(self, X): 
return self. .activation(self.batch_norm(self.conv2d_trans(X))) 


In default, the transposed convolution layer uses a kp = kw = 4 kernel, a sp = Sw = 2 strides, and 
a Ph = Pw = 1 padding. With a input shape of n, x n = 16 x 16, the generator block will double 
input’s width and height. 


[my kp — (na — 1) (kn — $n) — 2Ph] X [(nwkw — (nw — 1) (kw — Sw) — 2Pw] 
[(kn + 8n(mn — 1) — 2pp] x [kw + Sw(nw — 1) — 2Pw] 

[((4+2x (16-1)-2x 1] x [(4+ 2 x (16-1)-2x 1] 

32 x 32. 





1 1 
Np X Ny 


(17.2.1) 


x = np.zeros((2, 3, 16, 16)) 
g_blk = G_block(20) 
g_blk.initialize() 
g_blk(x).shape 


(2, 20, 32.3) 


If changing the transposed convolution layer to a 4 x 4 kernel, 1 x 1 strides and zero padding. With 
a input size of 1 x 1, the output will have its width and height increased by 3 respectively. 


x = np.zeros((2, 3, 1, 1)) 

g_blk = G_block(20, strides=1, padding=0) 
g_blk.initialize() 

g_blk(x).shape 


(2, 20, 4, 4) 


The generator consists of four basic blocks that increase input’s both width and height from 1 
to 32. At the same time, it first projects the latent variable into 64 x 8 channels, and then halve 
the channels each time. At last, a transposed convolution layer is used to generate the output. It 
further doubles the width and height to match the desired 64 x 64 shape, and reduces the channel 
size to 3. The tanh activation function is applied to project output values into the (—1, 1) range. 


n_G = 64 
net_G = nn. Sequential () 
net_G.add(G_block(n_Gx8, strides=1, padding=0), + Output: (64 x 8, 4, 4) 
G_block(n_G*x4), + Output: (64 * 4, 8, 8) 
G_block(n_G*2), + Output: (64 * 2, 16, 16) 
G_block(n_G) , # Output: (64, 32, 32) 
nn.Conv2DTranspose( 
3, kernel_size=4, strides=2, padding=1, use_bias=False, 
activation='tanh’)) + Output: (3, 64, 64) 


Generate a 100 dimensional latent variable to verify the generator’s output shape. 


x = np.zeros((1, 100, 1, 1)) 
net_G. initialize() 
net_G(x).shape 
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(1, 3, 64, 64) 


17.2.3 Discriminator 


The discriminator is a normal convolutional network network except that it uses a leaky ReLU as 
its activation function. Given a € [0, 1], its definition is 


x ifx>0 


leaky ReLU(x) = (17.2.2) 


ax otherwise 


As it can be seen, itis normal ReLU if a = 0, and an identity function if a = 1. Fora € (0, 1), leaky 
ReLU is a nonlinear function that give a non-zero output for a negative input. It aims to fix the 
“dying ReLU” problem that a neuron might always output a negative value and therefore cannot 
make any progress since the gradient of ReLU is 0. 


alphas = [0, .2, .4, .6, .8, 1] 


x = np.arange(-2, 1, 0.1) 
Y = [nn.LeakyReLU(alpha) (x).asnumpy() for alpha in alphas] 


en) 


d21.plot(x.asnumpy(), Y, ‘x’, 'y', alphas) 


1.0 

0.5 

0.0 
> 


—0.5 


—1.0 





The basic block of the discriminator is a convolution layer followed by a batch normalization layer 
and a leaky ReLU activation. The hyperparameters of the convolution layer are similar to the 
transpose convolution layer in the generator block. 


class D_block(nn.Block): 
def __init__(self, channels, kernel_size=4, strides=2, 
padding=1, alpha=0.2, **kwargs): 
super(D_block, self).__init__(**kwargs) 
self.conv2d = nn.Conv2D( 
channels, kernel_size, strides, padding, use_bias=False) 
self .batch_norm = nn.BatchNorm() 
self .activation = nn.LeakyReLU(alpha) 


def forward(self, X): 
return self. .activation(self.batch_norm(self.conv2d(X))) 
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A basic block with default settings will halve the width and height of the inputs, as we demon- 
strated in Section 6.3. For example, given a input shape np = Ny = 16, with a kernel shape 
kn = kw = 4, a stride shape s, = Sw = 2, and a padding shape p, = py = 1, the output shape will 
be: 


My, X My = L (Mp — kn + 291 + 8n)/Sn] X Lw — kw + 2Pw + Sw)/Sw] 
= |(16— 4+2 x 1 +2)/2] x |(16- 4+2 x 1 + 2)/2] (17.2.3) 
=8x 8. 


x = np.zeros((2, 3, 16, 16)) 
d_blk = D_block(20) 
d_blk.initialize() 

d_b1k(x). shape 


(2020.8 D 


The discriminator is a mirror of the generator. 


n_D = 64 
net_D = nn. Sequential () 
net_D.add(D_block(n_D), # Output: (64, 32, 32) 
D_block(n_Dx2), + Output: (64 * 2, 16, 16) 
D_block(n_Dx4), + Output: (64 * 4, 8, 8) 
D_block(n_Dx8), + Output: (64 * 8, 4, 4) 
nn.Conv2D(1, kernel_size=4, use_bias=False)) + Output: (1, 1, 1) 


It uses a convolution layer with output channel 1 as the last layer to obtain a single prediction 
value. 


x = np.zeros((1, 3, 64, 64)) 
net_D.initialize() 
net_D(x).shape 


(1, 1, 1, 1) 


17.2.4 Training 


Compared to the basic GAN in Section 17.1, we use the same learning rate for both generator and 
discriminator since they are similar to each other. In addition, we change (3) in Adam (Section 
11.10) from 0.9 to 0.5. It decreases the smoothness of the momentum, the exponentially weighted 
moving average of past gradients, to take care of the rapid changing gradients because the gener- 
ator and the discriminator fight with each other. Besides, the random generated noise Z, is a 4-D 
tensor and we are using GPU to accelerate the computation. 


def train(net_D, net_G, data_iter, num_epochs, 1r, latent_dim, 
device=d21.try_gpu()): 
loss = gluon.loss.SigmoidBCELoss() 
net_D.initialize(init=init.Normal(0.02), force_reinit=True, ctx=device) 
net_G.initialize(init=init.Normal(0.02), force_reinit=True, ctx=device) 


(continues on next page) 
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(continued from previous page) 


trainer_hp = ('learning_rate': lr, 'betal': 0.5) 
trainer_D = gluon.Trainer(net_D.collect_params(), 'adam', trainer_hp) 
trainer_G = gluon.Trainer(net_G.collect_params(), 'adam', trainer_hp) 
animator = d21.Animator(xlabel='epoch', ylabel='loss’, 
xlim=[1, num_epochs], nrows=2, figsize=(5, 5), 
legend=['discriminator', 'generator']) 
animator. fig.subplots_adjust (hspace=0. 3) 
for epoch in range(1, num_epochs + 1): 
# Train one epoch 
timer = d21.Timer() 
metric = d21.Accumulator(3) + loss_D, loss_G, num_examples 
for X, _ in data_iter: 
batch_size = X.shape[Q] 
Z = np.random.normal(0, 1, size=(batch_size, latent_dim, 1, 1)) 
X, Z = X.as_in_ctx(device), Z.as_in_ctx(device), 
metric.add(d21.update_D(X, Z, net_D, net_G, loss, trainer_D), 
d21.update_G(Z, net_D, net_G, loss, trainer_G), 
batch_size) 
# Show generated examples 
Z = np.random.normal(0, 1, size=(21, latent_dim, 1, 1), ctx=device) 
# Normalize the synthetic data to N(0, 1) 
fake_x = net_G(Z).transpose(0, 2, 3, 1) / 2 + 0.5 
imgs = np.concatenate( 
[np.concatenate([fake_x[i * 7 + j] for j in range(7)], axis=1) 
for i in range(len(fake_x)//7)], axis=0) 
animator.axes[1].cla() 
animator.axes[1].imshow(imgs.asnumpy()) 
# Show the losses 
loss_D, loss_G = metric[0] / metric[2], metric[1] / metric[2] 
animator.add(epoch, (loss_D, loss_G)) 
print(f'loss_D {loss_D:.3f}, loss_G (loss_G:.3f), ' 
f'(metric[2] / timer.stop():.1f} examples/sec on (str(device))') 


We train the model with a small number of epochs just for demonstration. For better perfor- 
mance, the variable num_epochs can be set to a larger number. 


latent_dim, lr, num_epochs = 100, 0.005, 20 
train(net_D, net_G, data_iter, num_epochs, lr, latent_dim) 





loss_D 0.272, loss_G 6.900, 2524.9 examples/sec on gpu(0) 
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Summary 
e DCGAN architecture has four convolutional layers for the Discriminator and four 
“fractionally-strided” convolutional layers for the Generator. 


* The Discriminator is a 4-layer strided convolutions with batch normalization (except its input 
layer) and leaky ReLU activations. 


e Leaky ReLU is a nonlinear function that give a non-zero output for a negative input. It aims 
to fix the “dying ReLU” problem and helps the gradients flow easier through the architecture. 


Exercises 


1. What will happen if we use standard ReLU activation rather than leaky ReLU? 
2. Apply DCGAN on Fashion-MNIST and see which category works well and which does not. 


Discussions? 





20 https://discuss.d21.ai/t/409 
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18 | Appendix: Mathematics for Deep 
Learning 


Brent Werness (Amazon), Rachel Hu (Amazon), and authors of this book 


One of the wonderful parts of modern deep learning is the fact that much of it can be understood 
and used without a full understanding of the mathematics below it. This is a sign that the field 
is maturing. Just as most software developers no longer need to worry about the theory of com- 
putable functions, neither should deep learning practitioners need to worry about the theoretical 
foundations of maximum likelihood learning. 


But, we are not quite there yet. 


In practice, you will sometimes need to understand how architectural choices influence gradient 
flow, or the implicit assumptions you make by training with a certain loss function. You might 
need to know what in the world entropy measures, and how it can help you understand exactly 
what bits-per-character means in your model. These all require deeper mathematical understand- 
ing. 


This appendix aims to provide you the mathematical background you need to understand the core 
theory of modern deep learning, but it is not exhaustive. We will begin with examining linear al- 
gebra in greater depth. We develop a geometric understanding of all the common linear algebraic 
objects and operations that will enable us to visualize the effects of various transformations on our 
data. A key element is the development of the basics of eigen-decompositions. 


We next develop the theory of differential calculus to the point that we can fully understand why 
the gradient isthe direction of steepest descent, and why back-propagation takes the form it does. 
Integral calculus is then discussed to the degree needed to support our next topic, probability 
theory. 


Problems encountered in practice frequently are not certain, and thus we need a language to speak 
about uncertain things. We review the theory of random variables and the most commonly en- 
countered distributions so we may discuss models probabilistically. This provides the foundation 
for the naive Bayes classifier, a probabilistic classification technique. 


Closely related to probability theory is the study of statistics. While statistics is far too large a field 
to do justice in a short section, we will introduce fundamental concepts that all machine learning 
practitioners should be aware of, in particular: evaluating and comparing estimators, conducting 
hypothesis tests, and constructing confidence intervals. 


Last, we turn to the topic of information theory, which is the mathematical study of information 
storage and transmission. This provides the core language by which we may discuss quantitatively 
how much information a model holds on a domain of discourse. 
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Taken together, these form the core ofthe mathematical concepts needed to begin down the path 
towards a deep understanding of deep learning. 


18.1 Geometry and Linear Algebraic Operations 


In Section 2.3, we encountered the basics of linear algebra and saw how it could be used to express 
common operations for transforming our data. Linear algebra is one of the key mathematical 
pillars underlying much of the work that we do in deep learning and in machine learning more 
broadly. While Section 2.3 contained enough machinery to communicate the mechanics of mod- 
ern deep learning models, there is a lot more to the subject. In this section, we will go deeper, 
highlighting some geometric interpretations of linear algebra operations, and introducing a few 
fundamental concepts, including of eigenvalues and eigenvectors. 


18.1.1 Geometry of Vectors 


First, we need to discuss the two common geometric interpretations of vectors, as either points 
or directions in space. Fundamentally, a vector is a list of numbers such as the Python list below. 


y = ll, Y, Of il 


Mathematicians most often write this as either a column or row vector, which is to say either as 


7 


x= Jo} (18.1.1) 
h 
or 
<= [1-70 1]. (18.1.2) 


These often have different interpretations, where data examples are column vectors and weights 
used to form weighted sums are row vectors. However, it can be beneficial to be flexible. As we 
have described in Section 2.3, though a single vector's default orientation is a column vector, for 
any matrix representing a tabular dataset, treating each data example as a row vector in the matrix 
is more conventional. 


Given a vector, the first interpretation that we should give itis as a point in space. In two or three 
dimensions, we can visualize these points by using the components of the vectors to define the 
location of the points in space compared to a fixed reference called the origin. This can be seen 
in Fig. 18.1.1. 
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Fig. 18.1.1: An illustration of visualizing vectors as points in the plane. The first component of the 
vector gives the x-coordinate, the second component gives the y-coordinate. Higher dimensions 
are analogous, although much harder to visualize. 


This geometric point of view allows us to consider the problem on a more abstract level. No longer 
faced with some insurmountable seeming problem like classifying pictures as either cats or dogs, 
we can start considering tasks abstractly as collections of points in space and picturing the task as 
discovering how to separate two distinct clusters of points. 


In parallel, there is a second point of view that people often take of vectors: as directions in space. 
Not only can we think of the vector v = [3,2]! as the location 3 units to the right and 2 units up 
from the origin, we can also think of it as the direction itself to take 3 steps to the right and 2 steps 
up. In this way, we consider all the vectors in figure Fig. 18.1.2 the same. 





Fig. 18.1.2: Any vector can be visualized as an arrow in the plane. In this case, every vector drawn 
is a representation of the vector (3, 2)'. 


One of the benefits of this shift is that we can make visual sense of the act of vector addition. In 
particular, we follow the directions given by one vector, and then follow the directions given by 
the other, as is seen in Fig. 18.1.3. 
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Fig. 18.1.3: We can visualize vector addition by first following one vector, and then another. 


Vector subtraction has a similar interpretation. By considering the identity that u = v + (u — v), 
we see that the vector u — v is the direction that takes us from the point v to the point u. 


18.1.2 Dot Products and Angles 


As we saw in Section 2.3, if we take two column vectors u and v, we can form their dot product by 
computing: 


pe 
Y= Yu ' Vi. (18.1.3) 
i 


Because (18.1.3) is symmetric, we will mirror the notation of classical multiplication and write 
u:-v=u v=v u, (18.1.4) 
to highlight the fact that exchanging the order of the vectors will yield the same answer. 


The dot product (18.1.3) also admits a geometric interpretation: it is closely related to the angle 
between two vectors. Consider the angle shown in Fig. 18.1.4. 





Fig. 18.1.4: Between any two vectors in the plane there is a well defined angle 0. We will see this 
angle is intimately tied to the dot product. 


To start, let us consider two specific vectors: 


v = (r,0) and w = (scos(0), ssin(0)). (18.1.5) 
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The vector vis length r and runs parallel to the x-axis, and the vector w is of length s and at angle 
6 with the x-axis. If we compute the dot product of these two vectors, we see that 


v-w=rscos(0) = ||v||||w|| cos(@). (18.1.6) 
With some simple algebraic manipulation, we can rearrange terms to obtain 
v-w 
0 = arccos CAR , (18.1.7) 
II iv 


In short, for these two specific vectors, the dot product combined with the norms tell us the angle 
between the two vectors. This same fact is true in general. We will not derive the expression here, 
however, if we consider writing ||v — w||? in two ways: one with the dot product, and the other 
geometrically using the law of cosines, we can obtain the full relationship. Indeed, for any two 
vectors v and w, the angle between the two vectors is 


V-W 

0 = arccos PAR : (18.1.8) 
INMI! 

This is a nice result since nothing in the computation references two-dimensions. Indeed, we can 

use this in three or three million dimensions without issue. 


As a simple example, let us see how to compute the angle between a pair of vectors: 


%matplotlib inline 

from d21 import mxnet as d21 
from IPython import display 

from mxnet import gluon, np, npx 
npx.set_np() 


def angle(v, w): 
return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w))) 


angle(np.array([0, 1, 2]), np.array([2, 3, 4])) 


array(0.41899002) 


We will not use it right now, but it is useful to know that we will refer to vectors for which the angle 
is 7/2 (or equivalently 90°) as being orthogonal. By examining the equation above, we see that this 
happens when 9 = 7/2, which is the same thing as cos(9) = 0. The only way this can happen is 
if the dot product itself is zero, and two vectors are orthogonal if and only if v- w = 0. This will 
prove to be a helpful formula when understanding objects geometrically. 


It is reasonable to ask: why is computing the angle useful? The answer comes in the kind of 
invariance we expect data to have. Consider an image, and a duplicate image, where every pixel 
value is the same but 10% the brightness. The values of the individual pixels are in general far from 
the original values. Thus, if one computed the distance between the original image and the darker 
one, the distance can be large. However, for most ML applications, the content is the same—it is 
still an image of a cat as far as a cat/dog classifier is concerned. However, if we consider the angle, 
it is not hard to see that for any vector v, the angle between v and 0.1 - vis zero. This corresponds 
to the fact that scaling vectors keeps the same direction and just changes the length. The angle 
considers the darker image identical. 


Examples like this are everywhere. In text, we might want the topic being discussed to not change 
if we write twice as long of document that says the same thing. For some encoding (such as count- 
ing the number of occurrences of words in some vocabulary), this corresponds to a doubling of 
the vector encoding the document, so again we can use the angle. 
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Cosine Similarity 


In ML contexts where the angle is employed to measure the closeness of two vectors, practitioners 
adopt the term cosine similarity to refer to the portion 


V-W 


jo E. 
costó) = ie 


(18.1.9) 
The cosine takes a maximum value of 1 when the two vectors point in the same direction, a min- 
imum value of —1 when they point in opposite directions, and a value of 0 when the two vectors 
are orthogonal. Note that if the components of high-dimensional vectors are sampled randomly 
with mean 0, their cosine will nearly always be close to 0. 


18.1.3 Hyperplanes 


In addition to working with vectors, another key object that you must understand to go far in linear 
algebra is the hyperplane, a generalization to higher dimensions of a line (two dimensions) or of a 
plane (three dimensions). In an d-dimensional vector space, a hyperplane has d — 1 dimensions 
and divides the space into two half-spaces. 


Let us start with an example. Suppose that we have a column vector w = [2, 1]'. We want to know, 
“what are the points v with w . v = 1?” By recalling the connection between dot products and 
angles above (18.1.8), we can see that this is equivalent to 

1 1 
IIvIlll wl cos(@) = 1 <= I[vl| cos(@) = | = ~. (18.1.10) 


Iwi v5 





Iivi] - cos(0) 


M 


Fig. 18.1.5: Recalling trigonometry, we see the formula ||v|| cos(@) is the length of the projection 
of the vector v onto the direction of w 


If we consider the geometric meaning of this expression, we see that this is equivalent to saying 
that the length of the projection of v onto the direction of w is exactly 1/||w||, as is shown in Fig. 
18.1.5. The set of all points where this is true is a line at right angles to the vector w. If we wanted, 
we could find the equation for this line and see that it is 2x + y = 1 or equivalently y = 1 — 22. 


If we now look at what happens when we ask about the set of points with w -v > 1 orw-v < 1, we 
can see that these are cases where the projections are longer or shorter than 1/||w||, respectively. 
Thus, those two inequalities define either side of the line. In this way, we have found a way to cut 
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our space into two halves, where all the points on one side have dot product below a threshold, 
and the other side above as we see in Fig. 18.1.6. 


v-w<l v-w=1 v-w>l 





Fig. 18.1.6: If we now consider the inequality version of the expression, we see that our hyperplane 
(in this case: just a line) separates the space into two halves. 


The story in higher dimension is much the same. If we now take w = [1,2,3]' and ask about the 
points in three dimensions with w - v = 1, we obtain a plane at right angles to the given vector w. 
The two inequalities again define the two sides of the plane as is shown in Fig. 18.1.7. 


v-w=1 


v-w<l v-w>l 


Fig. 18.1.7: Hyperplanes in any dimension separate the space into two halves. 


While our ability to visualize runs out at this point, nothing stops us from doing this in tens, hun- 
dreds, or billions of dimensions. This occurs often when thinking about machine learned models. 
For instance, we can understand linear classification models like those from Section 3.4, as meth- 
ods to find hyperplanes that separate the different target classes. In this context, such hyperplanes 
are often referred to as decision planes. The majority of deep learned classification models end with 
a linear layer fed into a softmax, so one can interpret the role of the deep neural network to be to 
find a non-linear embedding such that the target classes can be separated cleanly by hyperplanes. 


To give a hand-built example, notice that we can produce a reasonable model to classify tiny im- 
ages of t-shirts and trousers from the Fashion MNIST dataset (seen in Section 3.5) by just taking 
the vector between their means to define the decision plane and eyeball a crude threshold. First 
we will load the data and compute the averages. 


# Load in the dataset 
train = gluon.data.vision.FashionMNIST(train=True) 
test = gluon.data.vision.FashionMNIST(train=False) 


(continues on next page) 
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X_train_0 = np.stack(Lx[0] 
X_train_1 = np.stack(Lx[0] 
X_test = np.stack( 

[xEo] for x in test if 
y_test = np.stack( 

[x[1] for x in test if 


# Compute averages 
ave_0 = np.mean(X_train_0, 
ave_1 = np.mean(X_train_1, 


for x in train if 
ole A N EE war 


x[1] 


SAET] 


axis=0) 
axis=0) 


@ or x[1] 


ONO [EA] 


(continued from previous page) 


x[1] == 0]).astype(float) 
x[1] 1]).astype(float) 


1]).astype(float) 


1]).astype(float) 


It can be informative to examine these averages in detail, so let us plot what they look like. In this 
case, we see that the average indeed resembles a blurry image of a t-shirt. 


# Plot average t-shirt 
d21.set_figsize() 


d21.p1t.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys') 


d21.plt.show() 





In the second case, we again see that the average resembles a blurry image of trousers. 


# Plot average trousers 


d21.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys’) 


d21.plt.show() 
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In a fully machine learned solution, we would learn the threshold from the dataset. In this case, 
I simply eyeballed a threshold that looked good on the training data by hand. 


# Print test set accuracy with eyeballed threshold 
w = (ave_1 - ave_0).T 
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000 


# Accuracy 
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64) 


array(0.801, dtype=float64) 


18.1.4 Geometry of Linear Transformations 


Through Section 2.3 and the above discussions, we have a solid understanding of the geometry of 
vectors, lengths, and angles. However, there is one important object we have omitted discussing, 
and that is a geometric understanding of linear transformations represented by matrices. Fully 
internalizing what matrices can do to transform data between two potentially different high di- 
mensional spaces takes significant practice, and is beyond the scope of this appendix. However, 
we can start building up intuition in two dimensions. 


Suppose that we have some matrix: 
a b 
A= $ A . (18.1.11) 


If we want to apply this to an arbitrary vector v = |z, y]! , we multiply and see that 


_ er + sel 
(18.1.12) 
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This may seem like an odd computation, where something clear became somewhat impenetrable. 
However, it tells us that we can write the way that a matrix transforms any vector in terms of how 
it transforms two specific vectors: [1,0]' and [0,1]'. This is worth considering for a moment. We 
have essentially reduced an infinite problem (what happens to any pair of real numbers) to a finite 
one (what happens to these specific vectors). These vectors are an example a basis, where we can 
write any vector in our space as a weighted sum of these basis vectors. 


Let us draw what happens when we use the specific matrix 


iz 
ee (18.1.13) 


If we look at the specific vector v = [2,—1]', we see this is 2 - [1,0]' + —1 - [0,1]! , and thus we 
know that the matrix A will send this to 2(A[1, 0]') + —1(A[0, 1])' =2[1,-1]' — [2,3]' =[0,-5]'. 
If we follow this logic through carefully, say by considering the grid of all integer pairs of points, 
we see that what happens is that the matrix multiplication can skew, rotate, and scale the grid, but 
the grid structure must remain as you see in Fig. 18.1.8. 





Fig. 18.1.8: The matrix A acting on the given basis vectors. Notice how the entire grid is trans- 
ported along with it. 


This is the most important intuitive point to internalize about linear transformations represented 
by matrices. Matrices are incapable of distorting some parts of space differently than others. All 
they can do is take the original coordinates on our space and skew, rotate, and scale them. 


Some distortions can be severe. For instance the matrix 


2 1 
e-l; el (18.1.14) 


compresses the entire two-dimensional plane down to a single line. Identifying and working with 
such transformations are the topic of a later section, but geometrically we can see that this is 
fundamentally different from the types of transformations we saw above. For instance, the result 
from matrix A can be “bent back” to the original grid. The results from matrix B cannot because 
we will never know where the vector [1,2]! came from—was it [1,1]' or [0, —1]'? 


While this picture was for a 2 x 2 matrix, nothing prevents us from taking the lessons learned 
into higher dimensions. If we take similar basis vectors like [1, 0,...,0] and see where our matrix 
sends them, we can start to get a feeling for how the matrix multiplication distorts the entire space 
in whatever dimension space we are dealing with. 
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18.1.5 Linear Dependence 


Consider again the matrix 


2 -1 
B= É El (18.1.15) 


This compresses the entire plane down to live on the single line y = 2x. The question now arises: 
is there some way we can detect this just looking at the matrix itself? The answer is that indeed 
we can. Let us take bı = [2,4]! and bọ = [—1, —2]' be the two columns of B. Remember that we 
can write everything transformed by the matrix B as a weighted sum of the columns of the matrix: 
like a,b, + a2bə. We call this a linear combination. The fact that b; = —2 - bə means that we can 
write any linear combination of those two columns entirely in terms of say bə since 


a,b, + asb> = —2a1b + agho = (a2 = 2a1)b». (18.1.16) 


This means that one of the columns is, in a sense, redundant because it does not define a unique 
direction in space. This should not surprise us too much since we already saw that this matrix 
collapses the entire plane down into a single line. Moreover, we see that the linear dependence 


bı = —2-b» captures this. To make this more symmetrical between the two vectors, we will write 
this as 
b; +2-b>=0. (18.1.17) 
In general, we will say that a collection of vectors vı, ...,Vọ are linearly dependent if there exist 
coefficients a;,..., ap not all equal to zero so that 
k 
X aivy = 0. (18.1.18) 
i=l 


In this case, we can solve for one of the vectors in terms of some combination of the others, and 
effectively render it redundant. Thus, a linear dependence in the columns of a matrix is a witness 
to the fact that our matrix is compressing the space down to some lower dimension. If there is 
no linear dependence we say the vectors are linearly independent. If the columns of a matrix are 
linearly independent, no compression occurs and the operation can be undone. 


18.1.6 Rank 


If we have a general n x m matrix, it is reasonable to ask what dimension space the matrix maps 
into. A concept known as the rank will be our answer. In the previous section, we noted that a 
linear dependence bears witness to compression of space into a lower dimension and so we will 
be able to use this to define the notion of rank. In particular, the rank of a matrix A is the largest 
number of linearly independent columns amongst all subsets of columns. For example, the matrix 


2 4 
B= k | l (18.1.19) 


has rank(B) = 1, since the two columns are linearly dependent, but either column by itself is not 
linearly dependent. For a more challenging example, we can consider 


—1 0 1 1-1 


1 3 0 -1 "| 
C= | (18.1.20) 


0 3 1 0 -—1 
2 3-1 —2 1 
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and show that C has rank two since, for instance, the first two columns are linearly independent, 
however any of the four collections of three columns are dependent. 


This procedure, as described, is very inefficient. It requires looking at every subset ofthe columns 
of our given matrix, and thus is potentially exponential in the number of columns. Later we will 
see a more computationally efficient way to compute the rank of a matrix, but for now, this is 
sufficient to see that the concept is well defined and understand the meaning. 


18.1.7 Invertibility 


We have seen above that multiplication by a matrix with linearly dependent columns cannot be 
undone, i.e., there is no inverse operation that can always recover the input. However, multipli- 
cation by a full-rank matrix (i.e., some A that is n x n matrix with rank n), we should always be 
able to undo it. Consider the matrix 


Ps en ele (18.1.21) 
' e i 


which is the matrix with ones along the diagonal, and zeros elsewhere. We call this the identity 
matrix. It is the matrix which leaves our data unchanged when applied. To find a matrix which 
undoes what our matrix A has done, we want to find a matrix A”! such that 


ATIA = AA! =I. (18.1.22) 


If we look at this as a system, we have n x n unknowns (the entries of A~') and n x n equations 
(the equality that needs to hold between every entry of the product A7!A and every entry of I) so 
we should generically expect a solution to exist. Indeed, in the next section we will see a quantity 
called the determinant, which has the property that as long as the determinant is not zero, we can 
find a solution. We call such a matrix A”! the inverse matrix. As an example, if A is the general 
2 x 2 matrix 


a b 
A= f l , (18.1.23) 


then we can see that the inverse is 





: k at (18.1.24) 


ad—bc|—c a 


We can test to see this by seeing that multiplying by the inverse given by the formula above works 
in practice. 


M = np.array([[1, 2], [1, 411) 
M_inv = np.array([[2, -11, [-0.5, 0.511) 
M_inv.dot(M) 


array([[1., 0.1, 
Koce rI 
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Numerical Issues 


While the inverse of a matrix is useful in theory, we must say that most of the time we do not wish 
to usethe matrix inverse to solve a problem in practice. In general, there are far more numerically 
stable algorithms for solving linear equations like 


Ax=b, (18.1.25) 
than computing the inverse and multiplying to get 
x=A‘b. (18.1.26) 


Just as division by a small number can lead to numerical instability, so can inversion of a matrix 
which is close to having low rank. 


Moreover, it is common that the matrix A is sparse, which is to say that it contains only a small 
number of non-zero values. If we were to explore examples, we would see that this does not mean 
the inverse is sparse. Even if A was a 1 million by 1 million matrix with only 5 million non-zero 
entries (and thus we need only store those 5 million), the inverse will typically have almost every 
entry non-negative, requiring us to store all 1M? entries—that is 1 trillion entries! 


While we do not have time to dive all the way into the thorny numerical issues frequently encoun- 
tered when working with linear algebra, we want to provide you with some intuition about when 
to proceed with caution, and generally avoiding inversion in practice is a good rule of thumb. 


18.1.8 Determinant 


The geometric view of linear algebra gives an intuitive way to interpret a fundamental quantity 
known as the determinant. Consider the grid image from before, but now with a highlighted region 
(Fig. 18.1.9). 





Fig. 18.1.9: The matrix A again distorting the grid. This time, I want to draw particular attention 
to what happens to the highlighted square. 


Look at the highlighted square. This is a square with edges given by (0,1) and (1,0) and thus it 
has area one. After A transforms this square, we see that it becomes a parallelogram. There is 
no reason this parallelogram should have the same area that we started with, and indeed in the 
specific case shown here of 


A= | : a l (18.1.27) 
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itis an exercise in coordinate geometry to compute the area of this parallelogram and obtain that 
the area is 5. 


In general, if we have a matrix 


a b 
A= f 4 , (18.1.28) 


we can see with some computation that the area of the resulting parallelogram is ad — bc. This 
area is referred to as the determinant. 


Let us check this quickly with some example code. 


import numpy as np 
np.linalg.det(np.array([[1, -1], [2, 31D) 


5.000000000000001 


The eagle-eyed amongst us will notice that this expression can be zero or even negative. For the 
negative term, this is a matter of convention taken generally in mathematics: if the matrix flips 
the figure, we say the area is negated. Let us see now that when the determinant is zero, we learn 
more. 


Let us consider 


2 4 
B= ie de! E (18.1.29) 


If we compute the determinant of this matrix, we get 2-(—2)—4-(—1) = 0. Given our understanding 
above, this makes sense. B compresses the square from the original image down to a line segment, 
which has zero area. And indeed, being compressed into a lower dimensional space is the only 


way to have zero area after the transformation. Thus we see the following result is true: a matrix 
A is invertible if and only if the determinant is not equal to zero. 


As a final comment, imagine that we have any figure drawn on the plane. Thinking like computer 
scientists, we can decompose that figure into a collection of little squares so that the area of the 
figure is in essence just the number of squares in the decomposition. If we now transform that 
figure by a matrix, we send each of these squares to parallelograms, each one of which has area 
given by the determinant. We see that for any figure, the determinant gives the (signed) number 
that a matrix scales the area of any figure. 


Computing determinants for larger matrices can be laborious, but the intuition is the same. The 
determinant remains the factor that n x n matrices scale n-dimensional volumes. 


18.1.9 Tensors and Common Linear Algebra Operations 


In Section 2.3 the concept of tensors was introduced. In this section, we will dive more deeply into 
tensor contractions (the tensor equivalent of matrix multiplication), and see how it can provide a 
unified view on a number of matrix and vector operations. 


With matrices and vectors we knew how to multiply them to transform data. We need to have a 
similar definition for tensors if they are to be useful to us. Think about matrix multiplication: 


C = AB, (18.1.30) 
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or equivalently 
=> Y aba (18.1.31) 
k 


This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over 
that can be universally chosen, so we need specify exactly which indices we want to sum over. For 
instance we could consider 


gu = Y ad (18.1.32) 
jk 


Such a transformation is called a tensor contraction. It can represent a far more flexible family of 
transformations that matrix multiplication alone. 


As a often-used notational simplification, we can notice that the sum is over exactly those indices 
that occur more than once in the expression, thus people often work with Einstein notation, where 
the summation is implicitly taken over all repeated indices. This gives the compact expression: 


Yil = TijklQjk- (18.1.33) 


Common Examples from Linear Algebra 
Let us see how many of the linear algebraic definitions we have seen before can be expressed in 
this compressed tensor notation: 

ev-w=) viw 

+ Ivi? = X; vivi 

* (Av); = 2, 04jUj 

+ (AB) ix =>; 0ijDjr 

e tr(A) = >, aii 


In this way, we can replace a myriad of specialized notations with short tensor expressions. 


Expressing in Code 


Tensors may flexibly be operated on in code as well. As seen in Section 2.3, we can create tensors 
as is shown below. 


# Define tensors 

BS = scr CM 25 sil, Wl, E) Gill, E) Es Sly 10, dl, 124411) 
A = np.array([[1, 2], [3, 411) 

v = np.array([1, 2]) 

# Print out the shapes 


> 


.shape, B.shape, v.shape 


((2, 2), (2, 2, 3), (2,)) 
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Einstein summation has been implemented directly. The indices that occurs in the Einstein sum- 
mation can be passed as a string, followed by the tensors that are being acted upon. For in- 
stance, to implement matrix multiplication, we can consider the Einstein summation seen above 
(Av = a;;v;) and strip out the indices themselves to get the implementation: 


# Reimplement matrix multiplication 
np.einsum("ij, j -> i”, A, v), A.dot(v) 


(array([ 5, 11]), array([ 5, 11])) 


This is a highly flexible notation. For instance if we want to compute what would be traditionally 
written as 


Cri = X bijraivj. (18.1.34) 
ij 
it can be implemented via Einstein summation as: 


np.einsum("ijk, il, j -> kl”, B, A, v) 


array([[ 90, 126], 
[102, 144], 
[114, 162]]) 


This notation is readable and efficient for humans, however bulky if for whatever reason we need 
to generate a tensor contraction programmatically. For this reason, einsum provides an alternative 
notation by providing integer indices for each tensor. For example, the same tensor contraction 
can also be written as: 


np.einsum(B, [@, 1, 2], A, [o, 3], v, [1], [2, 3]) 


array([[ 90, 126], 
[102, 144], 
[114, 162]]) 


Either notation allows for concise and efficient representation of tensor contractions in code. 


Summary 


Vectors can be interpreted geometrically as either points or directions in space. 


Dot products define the notion of angle to arbitrarily high-dimensional spaces. 


Hyperplanes are high-dimensional generalizations of lines and planes. They can be used to 
define decision planes that are often used as the last step in a classification task. 


Matrix multiplication can be geometrically interpreted as uniform distortions of the under- 
lying coordinates. They represent a very restricted, but mathematically clean, way to trans- 
form vectors. 


Linear dependence is a way to tell when a collection of vectors are in a lower dimensional 
space than we would expect (say you have 3 vectors living in a 2-dimensional space). The 
rank of a matrix is the size of the largest subset of its columns that are linearly independent. 
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e When a matrix's inverse is defined, matrix inversion allows us to find another matrix that un- 
does the action of the first. Matrix inversion is useful in theory, but requires care in practice 
owing to numerical instability. 


e Determinants allow us to measure how much a matrix expands or contracts a space. A 
nonzero determinant implies an invertible (non-singular) matrix and a zero-valued deter- 
minant means that the matrix is non-invertible (singular). 


e Tensor contractions and Einstein summation provide for a neat and clean notation for ex- 
pressing many of the computations that are seen in machine learning. 


Exercises 


1. What is the angle between 


[o | H 
gal” w=]? (18.1.35) 


2. True or false: f 2 


1 -2 g 
? 
0 3 and E 1 | are inverses of one another? 


3. Suppose that we draw a shape in the plane with area 100m?. What is the area after trans- 
forming the figure by the matrix 


2 3 
i al (18.1.36) 


4. Which of the following sets of vectors are linearly independent? 


1 2 3 
. 0], 1 ]P,[(1 
=1 =] 1 
3 1 0 
. 1),/1],]0 
1 1 0 
1 0 1 
o 1, 1,0 
0 -1 1 
5. Suppose that you have a matrix written as A = A - la b] for some choice of values a, b, c, 


and d. True or false: the determinant of such a matrix is always 0? 


1 
6. The vectors ej = A and es = f 


that Ae; and Aez are orthogonal? 


| are orthogonal. What is the condition on a matrix A so 


7. How can you write tr(A*) in Einstein notation for an arbitrary matrix 4? 


Discussions?*! 





24 https://discuss.d21.ai/t/410 
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18.2 Eigendecompositions 


Eigenvalues are often one of the most useful notions we will encounter when studying linear alge- 
bra, however, as a beginner, it is easy to overlook their importance. Below, we introduce eigen- 
decomposition and try to convey some sense of just why itis so important. 


Suppose that we have a matrix A with the following entries: 


2 0 
A= f hal (18.2.1) 


If we apply A to any vector v = [x,y]', we obtain a vector Av = [2x,—y]'. This has an intuitive 


interpretation: stretch the vector to be twice as wide in the z-direction, and then flip it in the 
y-direction. 


However, there are some vectors for which something remains unchanged. Namely [1,0]! gets 
sent to [2,0]! and [0, 1]' gets sent to [0, —1]'. These vectors are still in the same line, and the only 
modification is that the matrix stretches them by a factor of 2 and —1 respectively. We call such 
vectors eigenvectors and the factor they are stretched by eigenvalues. 


In general, if we can find a number A and a vector v such that 
Av = Av. (18.2.2) 


We say that vis an eigenvector for A and A is an eigenvalue. 


18.2.1 Finding Eigenvalues 


Let us figure out how to find them. By subtracting off the Av from both sides, and then factoring 
out the vector, we see the above is equivalent to: 


(A — ADv = 0. (18.2.3) 


For (18.2.3) to happen, we see that (A — AI) must compress some direction down to zero, hence 
it is not invertible, and thus the determinant is zero. Thus, we can find the eigenvalues by finding 
for what A is det(A — AI) = 0. Once we find the eigenvalues, we can solve Av = Av to find the 
associated eigenvector(s). 


An Example 


Let us see this with a more challenging matrix 


2 1 
A= f | i (18.2.4) 


If we consider det(A — AI) = 0, we see this is equivalent to the polynomial equation 0 = (2— A)(3— 
A) — 2 = (4—A)(1— A). Thus, two eigenvalues are 4 and 1. To find the associated vectors, we then 


need to solve 
2 1jfz x 2 1) fa 4x 
E a A 7 A on E A H = balk (18.2.5) 


We can solve this with the vectors [1, —1]' and [1,2] ' respectively. 


We can check this in code using the built-in numpy. linalg.eig routine. 
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%matplotlib inline 
from d21 import mxnet as d21 
from IPython import display 
import numpy as np 


np.linalg.eig(np.array([[2, 11, [2, 31D) 


(array([1., 4.1), 
array([[-0.70710678, -0.4472136 1, 
[ 0.70710678, -0.89442719]])) 


Note that numpy normalizes the eigenvectors to be of length one, whereas we took ours to be of 


arbitrary length. Additionally, the choice of sign is arbitrary. However, the vectors computed are 
parallel to the ones we found by hand with the same eigenvalues. 


18.2.2 Decomposing Matrices 


Let us continue the previous example one step further. Let 


1 1 
W= E 2 l (18.2.6) 
be the matrix where the columns are the eigenvectors of the matrix A. Let 
1 0 
y = i A i (18.2.7) 


be the matrix with the associated eigenvalues on the diagonal. Then the definition of eigenvalues 
and eigenvectors tells us that 


AW = WÈ. (18.2.8) 


The matrix W is invertible, so we may multiply both sides by W”! on the right, we see that we 
may write 


A=WYNW |. (18.2.9) 


In the next section we will see some nice consequences of this, but for now we need only know 
that such a decomposition will exist as long as we can find a full collection of linearly independent 
eigenvectors (so that W is invertible). 


18.2.3 Operations on Eigendecompositions 


One nice thing about eigendecompositions (18.2.9) is that we can write many operations we usu- 
ally encounter cleanly in terms of the eigendecomposition. As a first example, consider: 


n times n times n times 


A” =A.--A=(WEW"!)..-(W=W!)=W>.-- Sw!) = WE'W]. ( ) 


This tells us that for any positive power of a matrix, the eigendecomposition is obtained by just 
raising the eigenvalues to the same power. The same can be shown for negative powers, so if we 
want to invert a matrix we need only consider 


A`! = WS IWE, (18.2.11) 
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or in other words, just invert each eigenvalue. This will work as long as each eigenvalue is non- 
zero, so we see that invertible is the same as having no zero eigenvalues. 


Indeed, additional work can show that if A1,..., An are the eigenvalues of a matrix, then the de- 
terminant of that matrix is 


det(A) = A1--: An, (18.2.12) 


or the product of all the eigenvalues. This makes sense intuitively because whatever stretching 
W does, W~! undoes it, so in the end the only stretching that happens is by multiplication by the 
diagonal matrix ©, which stretches volumes by the product of the diagonal elements. 


Finally, recall that the rank was the maximum number of linearly independent columns of your 
matrix. By examining the eigendecomposition closely, we can see that the rank is the same as the 
number of non-zero eigenvalues of A. 


The examples could continue, but hopefully the point is clear: eigendecomposition can simplify 
many linear-algebraic computations and is a fundamental operation underlying many numerical 
algorithms and much of the analysis that we do in linear algebra. 


18.2.4 Eigendecompositions of Symmetric Matrices 


It is not always possible to find enough linearly independent eigenvectors for the above process 
to work. For instance the matrix 


11 
A= E l : (18.2.13) 


has only a single eigenvector, namely (1,0)! . To handle such matrices, we require more advanced 
techniques than we can cover (such as the Jordan Normal Form, or Singular Value Decomposition). 
We will often need to restrict our attention to those matrices where we can guarantee the existence 
of a full set of eigenvectors. 


The most commonly encountered family are the symmetric matrices, which are those matrices 
where A = A! . In this case, we may take W to be an orthogonal matrix—a matrix whose columns 
are all length one vectors that are at right angles to one another, where W! = W~!—and all the 
eigenvalues will be real. 


Thus, in this special case, we can write (18.2.9) as 


A= WEW'. (18.2.14) 


18.2.5 Gershgorin Circle Theorem 


Eigenvalues are often difficult to reason with intuitively. If presented an arbitrary matrix, there is 
little that can be said about what the eigenvalues are without computing them. There is, however, 
one theorem that can make it easy to approximate well if the largest values are on the diagonal. 


Let A = (a;;) be any square matrix (n x n). We will define r; => ;.., |ai;|. Let D; represent the disc 
in the complex plane with center a;; radius r;. Then, every eigenvalue of A is contained in one of 
the Dj. 
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This can be a bit to unpack, so let us look at an example. 
Consider the matrix: 


1.0 0.1 0.1 01 
0.1 3.0 0.2 0.3 
AS lar 02 50 08|* (18.2.15) 


0.1 0.3 0.5 9.0 


We have rı = 0.3, r2 = 0.6, r3 = 0.8 and r4 = 0.9. The matrix is symmetric, so all eigenvalues are 
real. This means that all of our eigenvalues will be in one of the ranges of 











aji — 71,011 +71] = (0.7, 1.3], (18.2.16) 
a22 — T2, a22 + 12] = [2.4, 3.6], (18.2.17) 
a33 — 13,433 + r3] = [4.2, 5.8], (18.2.18) 
G44 — Ta, a44 + ra] = [8.1, 9.9]. (18.2.19) 





Performing the numerical computation shows that the eigenvalues are approximately 0.99, 2.97, 
4.95, 9.08, all comfortably inside the ranges provided. 


A = np.array([[1.0, 0.1, 0.1, 0.1], 
Edil, 3.0, 0,2, B.S], 
19.1, 0,2, 5,0, 0.31, 
Ms, Gd, G5, SI) 
v, - = np.linalg.eig(A) 


Vv 


array([9.08033648, 0.99228545, 4.95394089, 2.97343718]) 


In this way, eigenvalues can be approximated, and the approximations will be fairly accurate in 
the case that the diagonal is significantly larger than all the other elements. 


It is a small thing, but with a complex and subtle topic like eigendecomposition, it is good to get 
any intuitive grasp we can. 


18.2.6 A Useful Application: The Growth of Iterated Maps 


Now that we understand what eigenvectors are in principle, let us see how they can be used to 
provide a deep understanding of a problem central to neural network behavior: proper weight 
initialization. 
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Eigenvectors as Long Term Behavior 


The full mathematical investigation of the initialization of deep neural networks is beyond the 
scope of the text, but we can see a toy version here to understand how eigenvalues can help us see 
how these models work. As we know, neural networks operate by interspersing layers of linear 
transformations with non-linear operations. For simplicity here, we will assume that there is no 
non-linearity, and that the transformation is a single repeated matrix operation A, so that the 
output of our model is 


Vout =A- A--- AVin = AN Vin. (18.2.20) 


When these models are initialized, A is taken to be a random matrix with Gaussian entries, so let 
us make one of those. To be concrete, we start with a mean zero, variance one Gaussian distributed 
5 x 5 matrix. 


np. random. seed(8675309) 
ES 


A = np.random.randn(k, k) 
A 


array([[ 0.58902366, @.73311856, -1.1621888 , -0.55681601, -0.772488431, 
[-0.16822143, -0.41650391, -1.37843129, 0.74925588, 0.178884461, 
[ 0.69401121, -1.9780535 , -0.83381434, 0.56437344, 0.312012991, 
[-0.87334496, 0.15601291, -0.38710108, -0.23920821, 0.888501041, 
[ 1.29385371, -0.76774106, 0.20131613, 0.91800842, @.38974115]]) 


Behavior on Random Data 


For simplicity in our toy model, we will assume that the data vector we feed in Vin is a random five 
dimensional Gaussian vector. Let us think about what we want to have happen. For context, lets 
think of a generic ML problem, where we are trying to turn input data, like an image, into a pre- 
diction, like the probability the image is a picture of a cat. If repeated application of A stretches a 
random vector out to be very long, then small changes in input will be amplified into large changes 
in output—tiny modifications of the input image would lead to vastly different predictions. This 
does not seem right! 


On the flip side, if A shrinks random vectors to be shorter, then after running through many layers, 
the vector will essentially shrink to nothing, and the output will not depend on the input. This is 
also clearly not right either! 


We need to walk the narrow line between growth and decay to make sure that our output changes 
depending on our input, but not much! 


Let us see what happens when we repeatedly multiply our matrix A against a random input vector, 
and keep track of the norm. 


# Calculate the sequence of norms after repeatedly applying 'A' 
v_in = np.random.randn(k, 1) 


norm_list = [np.linalg.norm(v_in)] 
for i in range(1, 100): 


(continues on next page) 
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(continued from previous page) 


v_in = A.dot(v_in) 
norm_list.append(np. linalg.norm(v_in)) 


d21.plot(np.arange(0, 100), norm_list, ‘Iteration’, 'Value'>) 


1e29 


Value 


Iteration 


The norm is growing uncontrollably! Indeed if we take the list of quotients, we will see a pattern. 
# Compute the scaling factor of the norms 
norm_ratio_list = [] 


for i in range(1, 100): 
norm_ratio_list.append(norm_1list[il/norm_list[i - 1]) 


d21.plot(np.arange(1, 100), norm_ratio_list, ‘Iteration’, 'Ratio') 


2.2 
2.1 


2.0 


Ratio 


1.9 


1.8 


Iteration 


If we look at the last portion of the above computation, we see that the random vector is stretched 
by a factor of 1.974459321485[...], where the portion at the end shifts a little, but the stretching 
factor is stable. 





18.2. Eigendecompositions 831 


Relating Back to Eigenvectors 


We have seen that eigenvectors and eigenvalues correspond to the amount something is stretched, 
but that was for specific vectors, and specific stretches. Let us take a look at what they are for A. A 
bit of a caveat here: it turns out that to see them all, we will need to go to complex numbers. You 
can think of these as stretches and rotations. By taking the norm of the complex number (square 
root of the sums of squares of real and imaginary parts) we can measure that stretching factor. 
Let us also sort them. 


# Compute the eigenvalues 

eigs = np.linalg.eigvals(A).tolist() 
norm_eigs = [np.absolute(x) for x in eigs] 
norm_eigs.sort() 

print(f'norms of eigenvalues: {norm_eigs}’) 


norms of eigenvalues: [0.8786205280381857, 1.2757952665062624, 1.4983381517710659, 1. 
«4983381517710659, 1.974459321485074] 


An Observation 


We see something a bit unexpected happening here: that number we identified before for the long 
term stretching of our matrix A applied to a random vector is exactly (accurate to thirteen decimal 
places!) the largest eigenvalue of A. This is clearly not a coincidence! 


But, if we now think about what is happening geometrically, this starts to make sense. Consider a 
random vector. This random vector points a little in every direction, so in particular, it points at 
least a little bitin the same direction as the eigenvector of A associated with the largest eigenvalue. 
This is so important that it is called the principle eigenvalue and principle eigenvector. After apply- 
ing A, our random vector gets stretched in every possible direction, as is associated with every 
possible eigenvector, but it is stretched most of all in the direction associated with this principle 
eigenvector. What this means is that after apply in A, our random vector is longer, and points 
in a direction closer to being aligned with the principle eigenvector. After applying the matrix 
many times, the alignment with the principle eigenvector becomes closer and closer until, for 
all practical purposes, our random vector has been transformed into the principle eigenvector! 
Indeed this algorithm is the basis for what is known as the power iteration for finding the largest 
eigenvalue and eigenvector of a matrix. For details see, for example, (VanLoan & Golub, 1983). 


Fixing the Normalization 


Now, from above discussions, we concluded that we do not want a random vector to be stretched 
or squished at all, we would like random vectors to stay about the same size throughout the en- 
tire process. To do so, we now rescale our matrix by this principle eigenvalue so that the largest 
eigenvalue is instead now just one. Let us see what happens in this case. 


# Rescale the matrix ‘A*‘ 
A /= norm_eigs[-1] 


# Do the same experiment again 
v_in = np.random.randn(k, 1) 


(continues on next page) 
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(continued from previous page) 


norm_list = [np.linalg.norm(v_in)] 

for i in range(1, 100): 
v_in = A.dot(v_in) 
norm_list.append(np. linalg.norm(v_in)) 


d21.plot(np.arange(0, 100), norm_list, ‘Iteration’, 'Value'>) 


1.6 


Value 
= 
D> 


1.2 


Iteration 


We can also plot the ratio between consecutive norms as before and see that indeed it stabilizes. 


# Also plot the ratio 

norm_ratio_list = [] 

for i in range(1, 100): 
norm_ratio_list.append(norm_list[i]/norm_list[i-1]) 


d21.plot(np.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio') 


1.00 
0.95 


0.90 


Ratio 


0.85 


0.80 


Iteration 
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18.2.7 Conclusions 


We now see exactly what we hoped for! After normalizing the matrices by the principle eigenvalue, 
we see that the random data does not explode as before, but rather eventually equilibrates to a 
specific value. It would be nice to be able to do these things from first principles, and it turns out 
that if we look deeply at the mathematics of it, we can see that the largest eigenvalue of a large 
random matrix with independent mean zero, variance one Gaussian entries is on average about 
vn, or in our case v5 ~ 2.2, due to a fascinating fact known as the circular law (Ginibre, 1965). 
The relationship between the eigenvalues (and a related object called singular values) of random 
matrices has been shown to have deep connections to proper initialization of neural networks as 
was discussed in (Pennington et al., 2017) and subsequent works. 


Summary 


Eigenvectors are vectors which are stretched by a matrix without changing direction. 


Eigenvalues are the amount that the eigenvectors are stretched by the application of the 
matrix. 


The eigendecomposition of a matrix can allow for many operations to be reduced to opera- 
tions on the eigenvalues. 


The Gershgorin Circle Theorem can provide approximate values for the eigenvalues of a 
matrix. 


The behavior of iterated matrix powers depends primarily on the size of the largest eigen- 
value. This understanding has many applications in the theory of neural network initializa- 
tion. 


Exercises 
1. What are the eigenvalues and eigenvectors of 


2 1 
A= i F (18.2.21) 


2. What are the eigenvalues and eigenvectors of the following matrix, and what is strange about 
this example compared to the previous one? 
2 1 
A= J . (18.2.22) 


3. Without computing the eigenvalues, is it possible that the smallest eigenvalue of the follow- 
ing matrix is less that 0.5? Note: this problem can be done in your head. 


A= 0.1 1.0 0.1 0.2 
[0.3 0.1 5.0 0.0] ° 
10 0.2 0.0 1.8 


Fe 0.1 0.3 a 
(18.2.23) 


Discussions?* 





22 https://discuss.d21.ai/t/411 
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18.3 Single Variable Calculus 


In Section 2.4, we saw the basic elements of differential calculus. This section takes a deeper 
dive into the fundamentals of calculus and how we can understand and apply it in the context of 
machine learning. 


18.3.1 Differential Calculus 


Differential calculus is fundamentally the study of how functions behave under small changes. To 
see why this is so core to deep learning, let us consider an example. 


Suppose that we have a deep neural network where the weights are, for convenience, concatenated 
into a single vector w = (w1,..., Wn). Given a training dataset, we consider the loss of our neural 
network on this dataset, which we will write as L(w). 


This function is extraordinarily complex, encoding the performance of all possible models of the 
given architecture on this dataset, so itis nearly impossible to tell what set of weights w will min- 
imize the loss. Thus, in practice, we often start by initializing our weights randomly, and then 
iteratively take small steps in the direction which makes the loss decrease as rapidly as possible. 


The question then becomes something that on the surface is no easier: how do we find the direc- 
tion which makes the weights decrease as quickly as possible? To dig into this, let us first examine 
the case with only a single weight: L(w) = L(x) for a single real value z. 


Let us take x and try to understand what happens when we change it by a small amount to x + e. 
If you wish to be concrete, think a number like e = 0.0000001. To help us visualize what happens, 
let us graph an example function, f(x) = sin(x”), over the [0, 3]. 


%matplotlib inline 

from d21 import mxnet as d21 
from IPython import display 
from mxnet import np, npx 
npx.set_np() 


# Plot a function in a normal range 
x_big = np.arange(0.01, 3.01, 0.01) 
ys = np.sin(x_big**x_big) 
d21.plot(x_big, ys, ‘x’, 'f(x)’) 


1.0 
0.5 
=< 0.0 

Qe 

-0.5 
-1.0 

0 1 2 3 

X 
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At this large scale, the function's behavior is not simple. However, if we reduce our range to some- 
thing smaller like [1.75, 2.25], we see that the graph becomes much simpler. 


# Plot a the same function in a tiny range 
x_med = np.arange(1.75, 2.25, 0.001) 

ys = np.sin(x_med**x_med) 

d21.plot(x_med, ys, ‘x’, 'f(x)’) 


0.5 
0.0 

x 
-0.5 
-1.0 


Taking this to an extreme, if we zoom into a tiny segment, the behavior becomes far simpler: it is 
just a straight line. 


# Plot a the same function in a tiny range 
x_small = np.arange(2.0, 2.01, 0.0001) 

ys = np.sin(x_smallxx*x_small) 
d21.plot(x_small, ys, ‘x’, 'f(x)') 


—0.76 
—0.77 
* 
= —0.78 
—0.79 
—0.80 
2.000 2.002 2.004 2.006 2.008 2.010 


Xx 


This is the key observation of single variable calculus: the behavior of familiar functions can be 
modeled by a line in a small enough range. This means that for most functions, it is reasonable to 
expect that as we shift the x value of the function by a little bit, the output f(x) will also be shifted 
by a little bit. The only question we need to answer is, “How large is the change in the output 
compared to the change in the input? Is it half as large? Twice as large?” 
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Thus, we can consider the ratio of the change in the output of a function for a small change in the 
input of the function. We can write this formally as 
L(x +e)— L(x) Lu+e)-L(x) 


aa = < a (18.3.1) 








This is already enough to start to play around with in code. For instance, suppose that we know 
that L(x) = 2? +1701(x — 4)?, then we can see how large this value is at the point x = 4 as follows. 


# Define our function 
def L(x): 
return xxx2 + 1701x(x-4)xx*x3 


# Print the difference divided by epsilon for several epsilon 
for epsilon in [0.1, 0.001, 0.0001, 0.00001]: 
print(f’epsilon = {epsilon: .5f} -> ((L(4+epsilon) - L(4)) / epsilon: .5f}’) 


epsilon = 0.10000 -> 25.11000 
epsilon = 0.00100 -> 8.00270 
epsilon = 0.00010 -> 8.00012 
epsilon = 0.00001 -> 8.00001 


Now, if we are observant, we will notice that the output of this number is suspiciously close to 
8. Indeed, if we decrease e, we will see value becomes progressively closer to 8. Thus we may 
conclude, correctly, that the value we seek (the degree a change in the input changes the output) 
should be 8 at the point x = 4. The way that a mathematician encodes this fact is 


im Let 9) — L(4) 


e>0 € 


=8. (18.3.2) 





As a bit of a historical digression: in the first few decades of neural network research, scientists 
used this algorithm (the method of finite differences) to evaluate how a loss function changed under 
small perturbation: just change the weights and see how the loss changed. This is computationally 
inefficient, requiring two evaluations of the loss function to see how a single change of one vari- 
able influenced the loss. If we tried to do this with even a paltry few thousand parameters, it would 
require several thousand evaluations of the network over the entire dataset! It was not solved until 
1986 that the backpropagation algorithm introduced in (Rumelhart et al., 1988) provided a way to 
calculate how any change of the weights together would change the loss in the same computation 
time as a single prediction of the network over the dataset. 


Back in our example, this value 8 is different for different values of x, so it makes sense to define it 
as afunction of x. More formally, this value dependent rate of change is referred to as the derivative 
which is written as 





Different texts will use different notations for the derivative. For instance, all of the below nota- 
tions indicate the same thing: 
df d 


a nes Fl =Vaf = Def = fa: (18.3.4) 








Most authors will pick a single notation and stick with it, however even that is not guaranteed. It 
is best to be familiar with all of these. We will use the notation q throughout this text, unless 
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we want to take the derivative of a complex expression, in which case we will use ¿2 f to write 


expressions like 
2 4 + cos dE (18.3.5) 
da | | dg =i) |" iv 


Oftentimes, it is intuitively useful to unravel the definition of derivative (18.3.3) again to see how 
a function changes when we make a small change of z: 


=> G5 =) f(x +e) — f(x) (18.3.6) 


dx 
= fet) fle) + Ela). 











The last equation is worth explicitly calling out. It tells us that if you take any function and change 
the input by a small amount, the output would change by that small amount scaled by the deriva- 
tive. 


In this way, we can understand the derivative as the scaling factor that tells us how large of change 
we get in the output from a change in the input. 


18.3.2 Rules of Calculus 


We now turn to the task of understanding how to compute the derivative of an explicit function. A 
full formal treatment of calculus would derive everything from first principles. We will notindulge 
in this temptation here, but rather provide an understanding of the common rules encountered. 


Common Derivatives 


As was seen in Section 2.4, when computing derivatives one can oftentimes use a series of rules 
to reduce the computation to a few core functions. We repeat them here for ease of reference. 


- Derivative of constants. +c = 0. 


- Derivative of linear functions. (az) = a. 


- Power rule. 2” =n2""!, 


- Derivative of exponentials. Le” = e”. 


- Derivative of the logarithm. + log(x) = +. 


Derivative Rules 


If every derivative needed to be separately computed and stored in a table, differential calculus 
would be near impossible. It is a gift of mathematics that we can generalize the above derivatives 
and compute more complex derivatives like finding the derivative of f (x) = log (1 + (x — 1)1%). As 
was mentioned in Section 2.4, the key to doing so is to codify what happens when we take functions 
and combine them in various ways, most importantly: sums, products, and compositions. 


- Sum rule. £ (g(x) + h(x)) = (a) + Ë (x). 
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- Product rule. £ (g(x) - h(2)) = ga) 92 (x) + 49 (q hz). 
° Chain rule. 4 9(h(x)) = L(h(o)) - L(z). 


Let us see how we may use (18.3.6) to understand these rules. For the sum rule, consider following 
chain of reasoning: 


fu +e) = g(u+e) + h(u +e) 


~ gla) + (a) + h(a) +2 (0) 


= gle) + (a) + e (Geo + Zw) 


= fæ) + (Lar a): 
dg 


By comparing this result with the fact that f(x + €) = f(x) + ef (x ), we see that d(x ) = Ha) + 


La ) as desired. The intuition here is: when we change the input z, g and h jointly contribute to 


the change of the output by 9% (x x) and %(x). 


(18.3.7) 





The product is more subtle, and will require a new observation about how to work with these 
expressions. We will begin as before using (18.3.6): 


Ha+e =g(a+e)- a 


. ite a 
> e (Med reget) 


= g(a) -h(x no aZe) + Lew yn(x)) eSa) F (a) ee 


= te) + e (s0 Ze) + Lon) +eLO LO. 


This resembles the computation done above, and indeed we see our answer (= (x) = g(x) 92 (x) + 
99 (a )h(x)) sitting next to e, but there is the issue of that term of size e?. We will refer to this as a 
higher-order term, since the power of e? is higher than the power of et. We will see in a later section 
that we will sometimes want to keep track of these, however for now observe that if e = 0.0000001, 
then e? = 0.0000000000001, which is vastly smaller. As we send e > 0, we may safely ignore the 
higher order terms. As a general convention in this appendix, we will use “~” to denote that the 
two terms are equal up to higher order terms. However, if we wish to be more formal we may 
examine the difference quotient 


fete) = f) dh dg dg, .dh 
= g(x) 3 (a) + q, (a)h(x) + es (x)= (2), (18.3.9) 





€ 
and see that as we send e > 0, the right hand term goes to zero as well. 


Finally, with the chain rule, we can again progress as before using (18.3.6) and see that 


f(x +e) = g(h(x + e)) 


~ g (nla) +) 
(18.3.10) 
~ a(h(a)) +7 (2) ha) 


dg dh 
= fle) + AM) Ll), 
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where in the second line we view the function g as having its input (h(x)) shifted by the tiny quan- 
n dh 

tity ez, (x). 

These rule provide us with a flexible set of tools to compute essentially any expression desired. 

For instance, 


atente (a 197 ee o] 


1d 
da 

= (14 @= 1)" (Ft) + gplte— 1919) 
is 


süre a fa 1) (18.3.11) 











=10(1 +(x- ~ (x — 1)? 
= 10(%-1)? 
~ 1+(z-—1)! 
Where each line has used the following rules: 
1. The chain rule and derivative of logarithm. 
2. The sum rule. 
3. The derivative of constants, chain rule, and power rule. 
4. The sum rule, derivative of linear functions, derivative of constants. 
Two things should be clear after doing this example: 


1. Any function we can write down using sums, products, constants, powers, exponentials, and 
logarithms can have its derivate computed mechanically by following these rules. 


2. Having a human follow these rules can be tedious and error prone! 


Thankfully, these two facts together hint towards a way forward: this is a perfect candidate for 
mechanization! Indeed backpropagation, which we will revisit later in this section, is exactly that. 


Linear Approximation 


When working with derivatives, it is often useful to geometrically interpret the approximation 
used above. In particular, note that the equation 
d 
Hua Elo) (18.3.12) 
approximates the value of f by a line which passes through the point (x, f (x)) and has slope (x). 
In this way we say that the derivative gives a linear approximation to the function f, as illustrated 
below: 


# Compute sin 
xs = np.arange(-np.pi, np.pi, 0.01) 
plots = [np.sin(xs)] 


# Compute some linear approximations. Use d(sin(x)) / dx = cos(x) 
for 30 ain 51,5, 0, 218 
plots.append(np.sin(x0) + (xs - x0) * np.cos(x0)) 


d21.plot(xs, plots, 'x', 'f(x)’, ylim=[-1.5, 1.5]) 
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Higher Order Derivatives 


Let us now do something that may on the surface seem strange. Take a function f and compute 
the derivative df. This gives us the rate of change of f at any point. 


However, the derivative, E, can be viewed as a function itself, so nothing stops us from comput- 


ing the derivative of aL to get ey = a dE . We will call this the second derivative of f. This 
function is the rate of change of the rate of change of f, or in other words, how the rate of change 
is changing. We may apply the derivative any number of times to obtain what is called the n-th 


derivative. To keep the notation clean, we will denote the n-th derivative as 


f(z) _ af 2 (4) + (18.3.13) 


~ da” dx 
Let us try to understand why this is a useful notion. Below, we visualize f(x), f(x), and f(x). 


First, consider the case that the second derivative f®) (x) is a positive constant. This means that 
the slope of the first derivative is positive. As a result, the first derivative f(x) may start out 
negative, becomes zero at a point, and then becomes positive in the end. This tells us the slope of 
our original function f and therefore, the function f itself decreases, flattens out, then increases. 
In other words, the function f curves up, and has a single minimum as is shown in Fig. 18.3.1. 


Fx) Fx) fœ 


Fig. 18.3.1: If we assume the second derivative is a positive constant, then the fist derivative in 
increasing, which implies the function itself has a minimum. 


Second, if the second derivative is a negative constant, that means that the first derivative is de- 
creasing. This implies the first derivative may start out positive, becomes zero at a point, and then 
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becomes negative. Hence, the function f itself increases, flattens out, then decreases. In other 
words, the function f curves down, and has a single maximum as is shown in Fig. 18.3.2. 


=— E E A 
Fx) Fx) fœ 


Fig. 18.3.2: If we assume the second derivative is a negative constant, then the fist derivative in 
decreasing, which implies the function itself has a maximum. 


Third, if the second derivative is a always zero, then the first derivative will never change—it is 
constant! This means that f increases (or decreases) at a fixed rate, and f is itself a straight line 
as is shown in Fig. 18.3.3. 


— - == = ai 
PU FP) f(x) 


Fig. 18.3.3: If we assume the second derivative is zero, then the fist derivative is constant, which 
implies the function itself is a straight line. 


To summarize, the second derivative can be interpreted as describing the way that the function f 
curves. A positive second derivative leads to a upwards curve, while a negative second derivative 
means that f curves downwards, and a zero second derivative means that f does not curve at all. 


Let us take this one step further. Consider the function g(x) = ax? + ba +c. We can then compute 
that 


—(x) = 2ax + b 
Fa (18.3.14) 
g 
da) 52a 


If we have some original function f(x) in mind, we may compute the first two derivatives and 
find the values for a,b, and c that make them match this computation. Similarly to the previous 
section where we saw that the first derivative gave the best approximation with a straight line, 
this construction provides the best approximation by a quadratic. Let us visualize this for f(x) = 
sin(x). 
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# Compute sin 
xs = np.arange(-np.pi, np.pi, 0.01) 
plots = [np.sin(xs)] 
# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x) 
mar XO in (1.5, O, 215 
plots.append(np.sin(x0) + (xs - x0) * np.cos(x0) - 
(xs - x@)**2 x np.sin(x0) / 2) 


d21.plot(xs, plots, 'x', 'f(x)’, ylim=[-1.5, 1.5]) 


1.5 

1.0 

0.5 

2 0.0 
0.5 
—1.0 


-1.5 





We will extend this idea to the idea of a Taylor series in the next section. 


Taylor Series 


The Taylor series provides a method to approximate the function f(x) if we are given values for the 
first n derivatives ata point zo, i.e., { f (xo), f® (xo), f°) (xo), ..-, f (ao) }. The idea will be to find 
a degree n polynomial that matches all the given derivatives at xo. 


We saw the case of n = 2 in the previous section and a little algebra shows this is 


1d? d 
Fa) = 5 (0) — 20)? + E (20)( — 20) + $20). (18.3.15) 
2 dx dx 
As we can see above, the denominator of 2 is there to cancel out the 2 we get when we take two 
derivatives of x”, while the other terms are all zero. Same logic applies for the first derivative and 


the value itself. 


If we push the logic further to n = 3, we will conclude that 


d3 





Ela El (a d 

f(x) = a Y ta a l ae 0) zo)? | L oe — zo) + f (z0). (18.3.16) 

where the 6 = 3 x 2 = 3! comes from the constant we get in front if we take three derivatives of 
3 
T. 

Furthermore, we can get a degree n polynomial by 
nr la 
Pale) = Y PO — zo)! (18.3.17) 


1=0 
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where the notation 


fn) = af = (=) a (18.3.18) 


Indeed, P,, (1) can be viewed as the best n-th degree polynomial approximation to our function 
f(x). 


While we are not going to dive all the way into the error of the above approximations, it is worth 
mentioning the infinite limit. In this case, for well behaved functions (known as real analytic 
functions) like cos(x) or e”, we can write out the infinite number of terms and approximate the 
exactly same function 


œ p(n) 
=> PO — zo)”. (18.3.19) 
n=0 i 


Take f(x) = e” as am example. Since e” is its own derivative, we know that f™ (x) = e”. There- 
fore, e” can be reconstructed by taking the Taylor series at xo = 0, i.e., 

en 2 3 

y x x x 

= — = 1 — — cee, de 

e 27 trto tt (18.3.20) 


Let us see how this works in code and observe how increasing the degree of the Taylor approxi- 
mation brings us closer to the desired function e”. 


# Compute the exponential function 
xs = np.arange(0, 3, 0.01) 
ys = np.exp(xs) 


# Compute a few Taylor series approximations 

Pl = 1+ xs 

P2 = 1+ xs + xsxx2 / 2 

P5 = 1+ xs + xsx*2 / 2 + xsxx3 / 6 + xsx*x4 / 24 + xsxx5 / 120 


d21.plot(xs, Lys, P1, P2, P5], 'x’, 'f(x)’, legend=[ 
"Exponential", "Degree 1 Taylor Series”, "Degree 2 Taylor Series”, 
"Degree 5 Taylor Series”]) 


20 
— Exponential 


==- Degree 1 Taylor Series 
—-- Degree 2 Taylor Series / 
— Degree 5 Taylor Series /* 


15 





Taylor series have two primary applications: 
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1. Theoretical applications: Often when we try to understand a too complex function, using Tay- 


lor series enables us to turn it into a polynomial that we can work with directly. 


2. Numerical applications: Some functions like e” or cos(x) are difficult for machines to com- 


pute. They can store tables of values at a fixed precision (and this is often done), but it still 
leaves open questions like “What is the 1000-th digit of cos(1)?” Taylor series are often help- 
ful to answer such questions. 


Summary 


Derivatives can be used to express how functions change when we change the input by a 
small amount. 


Elementary derivatives can be combined using derivative rules to create arbitrarily complex 
derivatives. 


Derivatives can be iterated to get second or higher order derivatives. Each increase in order 
provides more fine grained information on the behavior of the function. 


Using information in the derivatives of a single data example, we can approximate well be- 
haved functions by polynomials obtained from the Taylor series. 


Exercises 


BR 0 N Be 


Discussions 


. What is the derivative of 1? — 4x + 1? 


. What is the derivative of log(+)? 


. True or False: If f'(x) = 0 then f has a maximum or minimum at z? 


. Where is the minimum of f(x) = xlog(x) for x > 0 (where we assume that f takes the 


limiting value of 0 at f(0))? 
243 


18.4 Multivariable Calculus 


Now that we have a fairly strong understanding of derivatives of a function of a single variable, let 
us return to our original question where we were considering a loss function of potentially billions 
of weights. 





28 https://discuss.d2l.ai/t/412 
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18.4.1 Higher-Dimensional Differentiation 


What Section 18.3 tells us is that if we change a single one of these billions of weights leaving 
every other one fixed, we know what will happen! This is nothing more than a function of a single 
variable, so we can write 


d 
L(wi + €1,2,...,wn) © L(w,w,..., WN) + ag E(w, W2,-.., WN). (18.4.1) 
1 


We will call the derivative in one variable while fixing the other the partial derivative, and we will 
a 


use the notation gg for the derivative in (18.4.1). 


Now, let us take this and change wz a little bit to wa + es: 


E(w, + erste + enwn) 2 Llw w+ ewn) Heg E(w, we + ea... wN + ey) 
x L(w 1, we,...,wn) 
‘he wa wn) 
uz W2., 
sd L 
Pe un (w1, wW2,..., WN) 
o 
Faaam ean) 
x L(w],wa,..., WN) 
2 L 
ra (w1, W3,..., WN) 
2 L 
TA (w1, w2,..., WN). 
(18.4.2) 


We have again used the idea that e, ez is a higher order term that we can discard in the same way 
we could discard e? in the previous section, along with what we saw in (18.4.1). By continuing in 
this manner, we may write that 


0 
L(w + €1, w2 + €2,...,wn + En) © L(w1, 2, -.., wn) +> Gin Le (wi 02, ---, WN). (18.4.3) 
. 2 
2 


This may look like a mess, but we can make this more familiar by noting that the sum on the right 
looks exactly like a dot product, so if we let 


= 
e=[e1,...,en]' and VxL = OU ee i (18.4.4) 
0x1 ON 
then 
L(w + e) = L(w) + €: VwL(w). (18.4.5) 


We will call the vector VwL the gradient of L. 


Equation (18.4.5) is worth pondering for a moment. It has exactly the format that we encountered 
in one dimension, just we have converted everything to vectors and dot products. It allows us to 
tell approximately how the function L will change given any perturbation to the input. As we will 
see in the next section, this will provide us with an important tool in understanding geometrically 
how we can learn using information contained in the gradient. 
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But first, let us see this approximation at work with an example. Suppose that we are working with 
the function 
e” e” 
i : (18.4.6) 
er + eY er + e¥ 





f(x,y) = log(e” + e”) with gradient V f (x, y) = | 
If we look at a point like (0, log(2)), we see that 
; ; 1 2 
f(x,y) = log(3) with gradient V f(x, y) = E dj : (18.4.7) 
Thus, if we want to approximate f at (e¡,log(2) + €2), we see that we should have the specific 
instance of (18.4.5): 


1 2 
F(e1,log(2) + €2) = log(3) + ga + 32: (18.4.8) 


We can test this in code to see how good the approximation is. 


%matplotlib inline 

from d21 import mxnet as d21 

from IPython import display 

from mpl_toolkits import mplot3d 
from mxnet import autograd, np, npx 
npx.set_np() 


def f(x, y): 
return np. log(np.exp(x) + np.exp(y)) 
def grad_f(x, y): 
return np.array(Lnp.exp(x) / (np.exp(x) + np.exp(y)), 
np.exp(y) / (np.exp(x) + np.exp(y))]) 


epsilon = np.array([0.01, -0.03]) 
grad_approx = f(0, np.log(2)) + epsilon.dot(grad_f(0, np.log(2))) 


true_value = f(0 + epsilon[0], np.log(2) + epsilon[1]) 
f' approximation: {grad_approx}, true Value: {true_value}’ 


"approximation: 1.0819456577301025, true Value: 1.0821242332458496' 


18.4.2 Geometry of Gradients and Gradient Descent 
Consider the again (18.4.5): 
L(w +e) ~ L(w) + €: VwL(w). (18.4.9) 


Let us suppose that I want to use this to help minimize our loss L. Let us understand geometrically 
the algorithm of gradient descent first described in Section 2.5. What we will do is the following: 
. Start with a random choice for the initial parameters w. 


. Find the direction v that makes L decrease the most rapidly at w. 


. Take a small step in that direction: w —> w + ev. 


A 0 N Be 


. Repeat. 
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The only thing we do not know exactly how to do is to compute the vector v in the second step. 
We will call such a direction the direction of steepest descent. Using the geometric understanding of 
dot products from Section 18.1, we see that we can rewrite (18.4.5) as 


L(w +v) ~ L(w) +v- VwL(w) = L(w) + ||VwL(w)|| cos(0). (18.4.10) 


Note that we have taken our direction to have length one for convenience, and used 8 for the angle 
between v and VwL(w). If we want to find the direction that decreases L as rapidly as possible, we 
want to make this expression as negative as possible. The only way the direction we pick enters 
into this equation is through cos(@), and thus we wish to make this cosine as negative as possible. 
Now, recalling the shape of cosine, we can make this as negative as possible by making cos(0) = —1 
or equivalently making the angle between the gradient and our chosen direction to be r radians, 
or equivalently 180 degrees. The only way to achieve this is to head in the exact opposite direction: 
pick v to point in the exact opposite direction to VwL(w)! 


This brings us to one of the most important mathematical concepts in machine learning: the 
direction of steepest decent points in the direction of —VywL(w). Thus our informal algorithm 
can be rewritten as follows. 


1. Start with a random choice for the initial parameters w. 

2. Compute VwL(w). 

3. Take a small step in the opposite of that direction: w > w — «VwL(w). 
4. Repeat. 


This basic algorithm has been modified and adapted many ways by many researchers, but the core 
concept remains the same in all of them. Use the gradient to find the direction that decreases the 
loss as rapidly as possible, and update the parameters to take a step in that direction. 


18.4.3 A Note on Mathematical Optimization 


Throughout this book, we focus squarely on numerical optimization techniques for the practical 
reason that all functions we encounter in the deep learning setting are too complex to minimize 
explicitly. 

However, itis a useful exercise to consider what the geometric understanding we obtained above 
tells us about optimizing functions directly. 


Suppose that we wish to find the value of xy which minimizes some function L(x). Let us suppose 
that moreover someone gives us a value and tells us that it is the value that minimizes L. Is there 
anything we can check to see if their answer is even plausible? 


Again consider (18.4.5): 
L(Xo + €) xy L (Xo) +e. VxL(Xo). (18.4.11) 


If the gradient is not zero, we know that we can take a step in the direction —eVxL(xp) to find a 
value of L that is smaller. Thus, if we truly are at a minimum, this cannot be the case! We can 
conclude that if xy is a minimum, then VxL(xo) = 0. We call points with VxL(Xo) = 0 critical 
points. 


This is nice, because in some rare settings, we can explicitly find all the points where the gradient 
is zero, and find the one with the smallest value. 
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For a concrete example, consider the function 


f(x) = 3x2* — Ag? — 1227. (18.4.12) 

This function has derivative 
dr = 122% — 125? — 24x = 12z(x — 2)(x +1). (18.4.13) 
The only possible location of minima are at x = —1,0,2, where the function takes the values 


—5, 0, —32 respectively, and thus we can conclude that we minimize our function when x = 2. A 
quick plot confirms this. 


x = np.arange(-2, 3, 0.01) 
f (3 * xxx4) - (4 * xxx3) - (12 * xxx2) 


dZIMplot Cm Pe. 1%, CIS 


20 
xX o 
20 


This highlights an important fact to know when working either theoretically or numerically: the 
only possible points where we can minimize (or maximize) a function will have gradient equal to 
zero, however, not every point with gradient zero is the true global minimum (or maximum). 


18.4.4 Multivariate Chain Rule 


Let us suppose that we have a function of four variables (w, x, y, and z) which we can make by 
composing many terms: 


f(u, v) = (u+ v)? 
u(a,b) = (a +b}, v(a, b) = (a — by, (18.4.14) 
a(w,z,y,z) =(wtatytz)’, b(w, x,y,z) = (w +z- y- 2). 





Such chains of equations are common when working with neural networks, so trying to under- 
stand how to compute gradients of such functions is key. We can start to see visual hints of this 
connection in Fig. 18.4.1 if we take a look at what variables directly relate to one another. 





18.4. Multivariable Calculus 849 





Fig. 18.4.1: The function relations above where nodes represent values and edges show functional 
dependence. 


Nothing stops us from just composing everything from (18.4.14) and writing out that 


fo, 2,92) = (w+ ety +2)? +(w+r-y- 22) + ((w+a+y+2) (w+z-—y yyy. 


(18.4.15) 





We may then take the derivative by just using single variable derivatives, but if we did that we 
would quickly find ourself swamped with terms, many of which are repeats! Indeed, one can see 
that, for instance: 


of 


ðw 











2(2(Aw+2+y+2)-2Aw+x-y-2) ((w+z+y+2) (w+z-—y 2) + 

2(Aw+z—y-2)+Aw+2+Yy+2)) ((w+z2—y-2)+(w+z+y+2))) x 

((o+o+y+2)?—(w+x y Y + ((w+a y z)? + (w+atytz))’). 
(18.4.16) 





If we then also wanted to compute a. we would end up with a similar equation again with many 
repeated terms, and many shared repeated terms between the two derivatives. This represents 
a massive quantity of wasted work, and if we needed to compute derivatives this way, the whole 
deep learning revolution would have stalled out before it began! 


Letus break up the problem. We will start by trying to understand how f changes when we change 
a, essentially assuming that w, x, y, and z all do not exist. We will reason as we did back when we 
worked with the gradient for the first time. Let us take a and add a small amount e to it. 


f(u(a+ e,b), v(a + e, b)) 
=f (uta, b) + la, b),v(a,b) + a D) 
Of Ou Of ðv 


=f(u(a, b), v(a, b)) +e Fy la, b), v(a, Y (a, b) + By ules b), v(a, Y) (a b) : 


(18.4.17) 


The first line follows from the definition of partial derivative, and the second follows from the 
definition of gradient. It is notationally burdensome to track exactly where we evaluate every 
derivative, as in the expression of (u(a, b), v(a, b)), so we often abbreviate this to the much more 
memorable 


Of 0f0u Of Ov 
ða duda v ða’ 


Itis useful to think about the meaning of the process. We are trying to understand how a function 
of the form f(u(a, b), v(a, b)) changes its value with a change in a. There are two pathways this can 


(18.4.18) 
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occur: there is the pathway where a > u — f and where a > v > f. We can compute both of 


these contributions via the chain rule: a . ce and ow . ue respectively, and added up. 





Imagine we have a different network of functions where the functions on the right depend on 
those that are connected to on the left as is shown in Fig. 18.4.2. 


ss SS 
04-009 
Fig. 18.4.2: Another more subtle example of the chain rule. 


To compute something like BE, we need to sum over all (in this case 3) paths from y to f givin 
p 8 ay p giving 


Of  OfOadu | Of Ou | Of Ob dv 
Oy Oadudy dudy 0bdvoy 


Understanding the chain rule in this way will pay great dividends when trying to understand how 
gradients flow through networks, and why various architectural choices like those in LSTMs (Sec- 
tion 9.2) or residual layers (Section 7.6) can help shape the learning process by controlling gradient 
flow. 





(18.4.19) 


18.4.5 The Backpropagation Algorithm 


Let us return to the example of (18.4.14) the previous section where 


f(u, v) = (u+ v)? 
u(a,b) = (a+ b)?, v(a,b) = (a — b)?, (18.4.20) 


a(w, x,y, Z) = (w+2+y+2), b(w, x,y,z) = (w+a-y-z)’. 





If we want to compute say oF we may apply the multi-variate chain rule to see: 


Of _ Of Ou | Of Ov 
dw  dudw dvdw’ 
Ou  0uda | Ou Ob 


du Ga Bw Bb Ow” 
Ov B Ov ða v ðb 


du dado ou 








(18.4.21) 





Let us try using this decomposition to compute r: Notice that all we need here are the various 
single step partials: 








of Of _ 
du = 2(u + v), Əv =2(u +v), 
OW alari W aast 
y E (18.4.22) 
da = 2(a b), ab = 2(a b), 
Ob 
tad — =2(w+zx-=y- 2). 
Ow Ow 


If we write this out into code this becomes a fairly manageable expression. 
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# Compute the value of the function from inputs to outputs 
Me e Wy 4 = lo O) 2 1 

a, b = (wt x + y + z)**2, (w+x- y - Z)**2 

u, v = (a + b)**2, (a - b)xx*2 

f = (u + v)xx2 

print(f’ iF aie A, GOR, A, a ais dE) 


# Compute the single step partials 

df_du, df_dv = 2x(u + v), 2*(u + v) 

du_da, du_db, dv_da, dv_db = 2*(a + b), 2*(a + b), 2x(a - b), -2*(a - b) 
da_dw, db_dw = 2x(w + x + y + z), 2k(w+ x - y - 2) 





# Compute the final result from inputs to outputs 

du_dw, dv_dw = du_da*da_dw + du_db*db_dw, dv_da*da_dw + dv_db*db_dw 
df_dw = df_duxdu_dw + df_dv«dv_dw 

print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}’) 








f at -1, 0, -2, 1 is 1024 
df/dw at -1, 0, -2, 1 is -4096 


However, note that this still does not make it easy to compute something like ef The reason for 
that is the way we chose to apply the chain rule. If we look at what we did above, we always kept 
Ow in the denominator when we could. In this way, we chose to apply the chain rule seeing how 
w changed every other variable. If that is what we wanted, this would be a good idea. However, 
think back to our motivation from deep learning: we want to see how every parameter changes 
the loss. In essence, we want to apply the chain rule keeping Of in the numerator whenever we 
can! 


To be more explicit, note that we can write 


Of af da df ab 
dw Oadw db dw’ 
Of 0f0u Of Ov 
ða duda  0v0a' 
Of 0f0u Of dv 
ðb dudb  0vob' 








(18.4.23) 





Note that this application of the chain rule has us explicitly compute L, PL, SL, TE, and BF Noth- 
ing stops us from also including the equations: 


Of af da . df a 
ðr aðr  0b0dx' 
Of Of0a of a 
Oy dady Ob Oy’ 
Of afda af ab 
Oz Oadz  0b0z' 





(18.4.24) 








and then keeping track of how f changes when we change any node in the entire network. Let us 
implement it. 


# Compute the value of the function from inputs to outputs 
We Xa YA Z= o 2 al 
a, b = (wt x + y + z)**2, (W + X - y - 2)x*x2 


(continues on next page) 
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(continued from previous page) 


u, v = (a + b)xx2, (a - b)x*x2 
f = (u + v)xx2 
eie te Ete irs es ZO SO 


# Compute the derivative using the decomposition above 

# First compute the single step partials 

df_du, df_dv = 2x(u + v), 2x(u + v) 

du_da, du_db, dv_da, dv_db = 2*(a + b), 2*(a + b), 2*(a - b), -2*x(a - b) 





da_dw, db_dw = 2x(w + x + y + z), 2k(w+ x - y - Z) 
da_dx, db_dx = 2x(w + x + y + z), 2*(w + x - y - Z) 
da dy, db_dy = 2x(w + x + y + z), -2x(w+ x y - Z) 
da_dz, db_dz = 2*(w + x + y + z), -2x(w + x - y - Z) 


# Now compute how f changes when we change any value from output to input 
df_da, df_db = df_duxdu_da + df_dv*dv_da, df_du*du_db + df_dv*dv_db 
df_dw, df_dx = df_daxda_dw + df_db*db_dw, df_da*da_dx + df_dbx*db_dx 
df_dy, df_dz = df_daxda_dy + df_db*db_dy, df_da*xda_dz + df_db«db_dz 











print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}') 
print(f'df/dx at {w}, {x}, {y}, {z} is {df_dx}') 
print(f'df/dy at {w}, {x}, {y}, {z} is {df_dy}’) 
print(f'df/dz at {w}, {x}, {y}, {z} is {df_dz}’) 


P ele “il, 0, 22 i) as 1024 


df/dw at -1, 0, -2, 1 is -4096 
df/dx at -1, 0, -2, 1 is -4096 
df/dy at -1, 0, -2, 1 is -4096 
df/dz at -1, 0, -2, 1 is -4096 


The fact that we compute derivatives from f back towards the inputs rather than from the inputs 
forward to the outputs (as we did in the first code snippet above) is what gives this algorithm its 
name: backpropagation. Note that there are two steps: 1. Compute the value of the function, and 
the single step partials from front to back. While not done above, this can be combined into a 
single forward pass. 2. Compute the gradient of f from back to front. We call this the backwards 
pass. 


This is precisely what every deep learning algorithm implements to allow the computation of the 
gradient of the loss with respect to every weight in the network at one pass. It is an astonishing 
fact that we have such a decomposition. 


To see how to encapsulated this, let us take a quick look at this example. 


# Initialize as ndarrays, then attach gradients 
w, X, y, Z = np.array(-1), np.array(0), np.array(-2), np.array(1) 


w.attach_grad() 
x. attach_grad() 
y.attach_grad() 
z.attach_grad() 


# Do the computation like usual, tracking gradients 
with autograd.record(): 
a, b = (w+ x + y + z)**2, (W + X - y - Z)**2 


(continues on next page) 
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(continued from previous page) 


u, v = (a + b)xx2, (a - b)xx2 
f = (u + v)xx*x2 


# Execute backward pass 
f .backward() 


print(f'df/dw at {w}, {x}, {y}, {z} is {w.grad}’) 
print(f'df/dx at {w}, {x}, {y}, {z} is {x.grad}’) 
print(f'df/dy at {w}, {x}, {y}, {z} is {y.grad}’) 
print(f'df/dz at {w}, {x}, {y}, {z} is {z.grad}’) 


df/dw at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dx at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dy at -1.0, 0.0, -2.0, 1.0 is -4096.0 
df/dz at -1.0, 0.0, -2.0, 1.0 is -4096.0 


All of what we did above can be done automatically by calling f.backwards(). 


18.4.6 Hessians 
As with single variable calculus, it is useful to consider higher-order derivatives in order to get a 
handle on how we can obtain a better approximation to a function than using the gradient alone. 


There is one immediate problem one encounters when working with higher order derivatives of 
functions of several variables, and that is there are a large number of them. If we have a function 





f(x1,...,tn) of n variables, then we can take n? many second derivatives, namely for any choice 
of i and j: 
Cy d d 
= ; 18.4.25 


This is traditionally assembled into a matrix called the Hessian: 
Ëf EF 
dxidx3 d11dtn 

H f= : 


| ep ea 
dzndzı dEndEn 
Not every entry of this matrix is independent. Indeed, we can show that as long as both mixed 


partials (partial derivatives with respect to more than one variable) exist and are continuous, we 
can say that for any i, and j, 


(18.4.26) 





Pf df 
dxidzj = Cia. 





(18.4.27) 


This follows by considering first perturbing a function in the direction of x;, and then perturbing 
itin x; and then comparing the result of that with what happens if we perturb first x; and then z,, 
with the knowledge that both of these orders lead to the same final change in the output of f. 


As with single variables, we can use these derivatives to get a far better idea of how the function 
behaves near a point. In particular, we can use it to find the best fitting quadratic near a point Xo, 
as we saw in a single variable. 
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Let us see an example. Suppose that f (x1, £2) =a+b121+b309+C1111+c190119+C2915. This is the 
general form for a quadratic in two variables. If we look at the value of the function, its gradient, 
and its Hessian (18.4.26), all at the point zero: 


f(0,0) =a, 
by 
Vf(0,0) = la (18.4.28) 
_ |[2c11 Cia 
Hf(0, 0) — E | 


we can get our original polynomial back by saying 
f(x) = f(0) + VFO) -x+ x H moe (18.4.29) 
In general, if we computed this expansion any point xq, we see that 
f(x) = f(X0) + Vf (Ko) - (X — Xo) + 5 — Xo)! Hf (xo)(x — xo). (18.4.30) 


This works for any dimensional input, and provides the best approximating quadratic to any func- 
tion at a point. To give an example, let us plot the function 


2 


f(a,y) =e", (18.4.31) 


One can compute that the gradient and Hessian are 


2 3 2 
jay (1— 2r _ py? [Ax — 6r 4ax*y — 2y 
Vile.) =e ( y ) and Hf(x,y) =e a? dd (18.4.32) 
And thus, with a little algebra, see that the approximating quadratic at [—1, 0]! is 
f(x,y) ze? (-1- (e+ 1) + (e+ 1)? +”). (18.4.33) 


# Construct grid and compute function 

x, y = np.meshgrid(np.linspace(-2, 2, 101), 
np.linspace(-2, 2, 101), indexing='ij') 

z = x*np.exp(- x*x*2 - yxx*x2) 


# Compute approximating quadratic with gradient and Hessian at (1, Q) 
w = np.exp(-1)*(-1 - (x + 1) + (x + 1)xx2 + yxx*x2) 


# Plot function 

ax = d21.pl1t.figure() .add_subplot(111, projection='3d’) 
ax.plot_wireframe(x, y, z, **{’rstride’: 10, 'cstride': 10}) 
ax.plot_wireframe(x, y, w, **{’rstride’: 10, 'cstride': 10}, color='purple’) 
d21.p1t.xlabel('x'> 

d21.p1t.ylabel('y'> 

d21.set_figsize() 

ax.set_xlim(-2, 2) 

ax.set_ylim(-2, 2) 

ax.set_zlim(-1, 1) 

ax.dist = 12 
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This forms the basis for Newton's Algorithm discussed in Section 11.3, where we perform numer- 
ical optimization iteratively finding the best fitting quadratic, and then exactly minimizing that 
quadratic. 


18.4.7 A Little Matrix Calculus 


Derivatives of functions involving matrices turn out to be particularly nice. This section can be- 
come notationally heavy, so may be skipped in a first reading, but it is useful to know how deriva- 
tives of functions involving common matrix operations are often much cleaner than one might 
initially anticipate, particularly given how central matrix operations are to deep learning applica- 
tions. 


Let us begin with an example. Suppose that we have some fixed column vector 3, and we want 
to take the product function f(x) = 8' x, and understand how the dot product changes when we 
change x. 


A bit of notation that will be useful when working with matrix derivatives in ML is called the de- 
nominator layout matrix derivative where we assemble our partial derivatives into the shape of 
whatever vector, matrix, or tensor is in the denominator of the differential. In this case, we will 
write 


af 
dzı 
d 
ell (18.4.34) 
x fa 
dEn 
where we matched the shape of the column vector x. 
If we write out our function into components this is 
n 
f(x) = > Pix; = Piti + +++ + Brn. (18.4.35) 
i=1 


If we now take the partial derivative with respect to say 31, note that everything is zero but the first 
term, which is just xı multiplied by 81, so the we obtain that 


Ta Bu, (18.4.36) 
dx 

or more generally that 
E ih (18.4.37) 
dx; 
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We can now reassemble this into a matrix to see 


j £ Br 
Laaa ene cf (18.4.38) 
dx df i 

dan Pn 


This illustrates a few factors about matrix calculus that we will often counter throughout this sec- 
tion: 


First, The computations will get rather involved. 


Second, The final results are much cleaner than the intermediate process, and will always 
look similar to the single variable case. In this case, note that “ (bx) = band 4 (8! x) = 8 
are both similar. 


Third, transposes can often appear seemingly from nowhere. The core reason for this is the 
convention that we match the shape of the denominator, thus when we multiply matrices, 
we will need to take transposes to match back to the shape of the original term. 


To keep building intuition, let us try a computation that is a little harder. Suppose that we have a 
column vector x, and a square matrix A and we want to compute 


dot 
x! Ax). 18.4.39 
E ) ( ) 


To drive towards easier to manipulate notation, let us consider this problem using Einstein nota- 
tion. In this case we can write the function as 


x Ax = 2050. (18.4.40) 


To compute our derivative, we need to understand for every k, what the value of 





d + d 
By the product rule, this is 
dx; 
E aL = Ta i + A (18.4.42) 
For a term like ge , itis not hard to see that this is one when i = k and zero otherwise. This means 


that every term where i and k are different vanish from this sum, so the only terms that remain in 
that first sum are the ones where i = k. The same reasoning holds for the second term where we 
need j = k. This gives 


TT Tilijlj = Ajj + Tilik. (18.4.43) 
dx y 
Now, the names of the indices in Einstein notation are arbitrary—the fact that ¿and j are different 
is immaterial to this computation at this point, so we can re-index so that they both use i to see 
that 
d 
qa cist = Akili + Lilik = (Aki + Bik) 24. (18.4.44) 
k 
Now, here is where we start to need some practice to go further. Let us try and identify this out- 
come in terms of matrix operations. api + a; is the k, i-th component of A+ A'. This gives 


d 
Apes = [A + A'Jkiti. (18.4.45) 
Tk 
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Similarly, this term is now the product of the matrix A + A' by the vector x, so we see that 


d d + 
AA) 3 = drp VI = [(A +A )x] x. (18.4.46) 
Thus, we see that the k-th entry of the desired derivative from (18.4.39) is just the k-th entry of the 
vector on the right, and thus the two are the same. Thus yields 

L (x Ax) = (A+A")x. (18.4.47) 
This required significantly more work than our last one, but the final result is small. More than 
that, consider the following computation for traditional single variable derivatives: 

dx dx 


d 
= = j ; 18.4.48 
(zaz) ax + xa (a+a)z ( ) 





Equivalently L(ax?) = 2ax = (a + a)z. Again, we get a result that looks rather like the single 
variable result but with a transpose tossed in. 


Atthis point, the pattern should be looking rather suspicious, so let us try to figure out why. When 
we take matrix derivatives like this, let us first assume that the expression we get will be another 
matrix expression: an expression we can write it in terms of products and sums of matrices and 
their transposes. If such an expression exists, it will need to be true for all matrices. In particular, 
it will need to be true of 1 x 1 matrices, in which case the matrix product is just the product of the 
numbers, the matrix sum is just the sum, and the transpose does nothing at all! In other words, 
whatever expression we get must match the single variable expression. This means that, with 
some practice, one can often guess matrix derivatives just by knowing what the associated single 
variable expression must look like! 


Let us try this out. Suppose that X isan x m matrix, U is an n x r and Visan r x m. Let us try to 
compute 


£x — Uv? =? (18.4.49) 


This computation is important in an area called matrix factorization. For us, however, it is just a 
derivative to compute. Let us try to imaging what this would be for 1 x 1 matrices. In that case, 
we get the expression 


d 

— (x — uv)? = —2(x — uv)u, (18.4.50) 
du 

where, the derivative is rather standard. If we try to convert this back into a matrix expression we 
get 


Six — UV||2 = —2(X — UV)U. (18.4.51) 


However, if we look at this it does not quite work. Recall that X is n x m, as is UV, so the matrix 
2(X — UV) isn x m. On the other hand U is n x r, and we cannot multiply an x mandan xr 
matrix since the dimensions do not match! 


We want to get A which is the same shape of V, which is r x m. So somehow we need to take a 
n x m matrix andan x r matrix, multiply them together (perhaps with some transposes) to get a 
r x m. We can do this by multiplying U! by (X — UV). Thus, we can guess the solution to (18.4.49) 
is 


1 X — UVI? = —2U! (X — UV). (18.4.52) 
dV 2 
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To show that this works, we would be remiss to not provide a detailed computation. If we already 
believe that this rule-of-thumb works, feel free to skip past this derivation. To compute 


£x — UV||Ż, (18.4.53) 
we must find for every a, and b 
2 
{x — UV\||3 = = 2. (s — 2 uams) i (18.4.54) 


Recalling that all entries of X and U are constants as far as da 1 is concerned, we may push the 
derivative inside the sum, and apply the chain rule to the square to get 


d dur: 
ala — UV||5 = y 2 (+ = y uan) (- y wage ' (18.4.55) 
a ij k k a 


As in the previous derivation, we may note that an is only non-zero if the k = a and j = b. If 


either of those conditions do not hold, the term in the sum is zero, and we may freely discard it. 
We see that 





d 
Gas - UV] = -2) (+ -5 vans] Uia- (18.4.56) 
, i k 


An important subtlety here is that the requirement that k = a does not occur inside the inner 
sum since that k is a dummy variable which we are summing over inside the inner term. For a 
notationally cleaner example, consider why 


2 
+ (= a) =2 (= «| . (18.4.57) 


From this point, we may start identifying components of the sum. First, 
y UikVEb = [UV]. (18.4.58) 
k 
So the entire expression in the inside of the sum is 
Tip — Y Ugo = [X — UV]. (18.4.59) 
k 
This means we may now write our derivative as 


d 
e |X — UVI = -2 9 [X— UV]ipuia. (18.4.60) 
a 


i 


We want this to look like the a, b element of a matrix so we can use the technique as in the previous 
example to arrive at a matrix expression, which means that we need to exchange the order of the 
indices on uja. If we notice that uja = [U'],;, we can then write 


Ix- ae Jai[X — UV]. (18.4.61) 
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This is a matrix product, and thus we can conclude that 
d 2 T 
—— ||X — UV||5 = —2[U_(X — UV) |a». (18.4.62) 
dvap 
and thus we may write the solution to (18.4.49) 
ix — UV? = —2U' (X — UV). (18.4.63) 


This matches the solution we guessed above! 


It is reasonable to ask at this point, “Why can I not just write down matrix versions of all the cal- 
culus rules I have learned? It is clear this is still mechanical. Why do we not just get it over with!” 
And indeed there are such rules and (Petersen et al., 2008) provides an excellent summary. How- 
ever, due to the plethora of ways matrix operations can be combined compared to single values, 
there are many more matrix derivative rules than single variable ones. It is often the case that it 
is best to work with the indices, or leave it up to automatic differentiation when appropriate. 


Summary 


e In higher dimensions, we can define gradients which serve the same purpose as derivatives 
in one dimension. These allow us to see how a multi-variable function changes when we 
make an arbitrary small change to the inputs. 


° The backpropagation algorithm can be seen to be a method of organizing the multi-variable 
chain rule to allow for the efficient computation of many partial derivatives. 


e Matrix calculus allows us to write the derivatives of matrix expressions in concise ways. 


Exercises 


1. Given a column vector (8, compute the derivatives of both f(x) = 8'xand g(x) =x'8. Why 
do you get the same answer? 


2. Let v be an n dimension vector. What is 2 \|v||2? 


3. Let L(x, y) = log(e” + e”). Compute the gradient. What is the sum of the components of the 
gradient? 


4. Let f(x,y) = x?y + ry’. Show that the only critical point is (0,0). By considering f(x, x), 
determine if (0,0) is a maximum, minimum, or neither. 


5. Suppose that we are minimizing a function f(x) = g(x) + h(x). How can we geometrically 
interpret the condition of Vf = 0 in terms of g and h? 


Discussions?** 





4 https://discuss.d21.ai/t/413 
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18.5 Integral Calculus 


Differentiation only makes up half of the content of a traditional calculus education. The other 
pillar, integration, starts out seeming a rather disjoint question, “What is the area underneath this 
curve?” While seemingly unrelated, integration is tightly intertwined with the differentiation via 
what is known as the fundamental theorem of calculus. 


Atthe level of machine learning we discuss in this book, we will not need a deep understanding of 
integration. However, we will provide a brief introduction to lay the groundwork for any further 
applications we will encounter later on. 


18.5.1 Geometric Interpretation 


Suppose that we have a function f(x). For simplicity, let us assume that f(x) is non-negative (never 
takes a value less than zero). What we want to try and understand is: what is the area contained 
between f(x) and the x-axis? 


%matplotlib inline 

from d21 import mxnet as d21 
from IPython import display 

from mpl_toolkits import mplot3d 
from mxnet import np, npx 
npx.set_np() 


x 
l 


= np.arange(-2, 2, 0.01) 
f = np.exp(-x**2) 


d2l.set_figsize() 
d2l.plt.plot(x, f, color='black') 
d2l.plt.fill_between(x.tolist(), f.tolist()) 
d21.p1t.show() 
1.0 
0.8 
0.6 
0.4 


0.2 





0.0 


In most cases, this area will be infinite or undefined (consider the area under f(x) = x”), so people 
will often talk about the area between a pair of ends, say a and b. 


x np.arange(-2, 2, 0.01) 
f = np.exp(-x**2) 


(continues on next page) 





18.5. Integral Calculus 861 


(continued from previous page) 


d21.set_figsize() 
d21.p1t.plot(x, f, color='black'> 
d21.plt.fill_between(x.tolist()[50:250], f.tolist()[50:2507) 
d21.p1t.show() 
1.0 
0.8 
0.6 
0.4 


0.2 





0.0 


We will denote this area by the integral symbol below: 


b 
re / fay ae, (18.5.1) 


The inner variable is a dummy variable, much like the index of a sum in a >”, and so this can be 
equivalently written with any inner value we like: 


f ' f(x) de = / f(z) dz. (18.5.2) 


There is a traditional way to try and understand how we might try to approximate such integrals: 
we can imagine taking the region in-between a and b and chopping it into N vertical slices. If N 
is large, we can approximate the area of each slice by a rectangle, and then add up the areas to get 
the total area under the curve. Let us take a look at an example doing this in code. We will see 
how to get the true value in a later section. 


epsilon = 0.05 


a=0 
b= 2 
x = np.arange(a, b, epsilon) 


f =x / (1 + xxx2) 


approx = np.sum(epsilon*f) 
true = np.log(2) / 2 


d21.set_figsize() 

d21.plt.bar(x.asnumpy(), f.asnumpy(), width=epsilon, align='edge') 
d21.plt.plot(x, f, color='black'>) 

d21.p1t.ylim([o, 1]) 

d21.p1t.show() 


f'approximation: {approx}, truth: {true}’ 
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1.0 


0.8 





"approximation: 0.7944855690002441, truth: 0.34657359027997264" 


The issue is that while it can be done numerically, we can do this approach analytically for only 
the simplest functions like 


b 
f z dz. (18.5.3) 


Anything somewhat more complex like our example from the code above 


b x 
l 18.5.4 
f 1 + «a? de ( ) 


is beyond what we can solve with such a direct method. 


We will instead take a different approach. We will work intuitively with the notion of the area, and 
learn the main computational tool used to find integrals: the fundamental theorem of calculus. This 
will be the basis for our study of integration. 


18.5.2 The Fundamental Theorem of Calculus 
To dive deeper into the theory of integration, let us introduce a function 
F(z) = J Handy. (18.5.5) 
0 


This function measures the area between 0 and x depending on how we change x. Notice that this 
is everything we need since 


f ' f(x) de = F(b) — F(a). (18.5.6) 


This is a mathematical encoding of the fact that we can measure the area out to the far end-point 
and then subtract off the area to the near end point as indicated in Fig. 18.5.1. 
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el al 


Fig. 18.5.1: Visualizing why we may reduce the problem of computing the area under a curve 
between two points to computing the area to the left of a point. 


Thus, we can figure out what the integral over any interval is by figuring out what F (x) is. 


To do so, let us consider an experiment. As we often do in calculus, let us imagine what happens 
when we shift the value by a tiny bit. From the comment above, we know that 


Pe+e Fa i ty) dy. (18.5.7) 


This tells us that the function changes by the area under a tiny sliver of a function. 


This is the point at which we make an approximation. If we look at a tiny sliver of area like this, it 
looks like this area is close to the rectangular area with height the value of f (x) and the base width 
c. Indeed, one can show that as e > 0 this approximation becomes better and better. Thus we can 
conclude: 


F(a+e)— F(x) =ef(x). (18.5.8) 


However, we can now notice: this is exactly the pattern we expect if we were computing the deriva- 
tive of F! Thus we see the following rather surprising fact: 


(a) = f(a). (18.5.9) 


This is the fundamental theorem of calculus. We may write it in expanded form as 
£f sw) a= fe) (18.5.10) 
als y) dy = f(x). on 


It takes the concept of finding areas (a priori rather hard), and reduces it to a statement derivatives 
(something much more completely understood). One last comment that we must make is that this 
does not tell us exactly what F(x) is. Indeed F(x) + C for any C has the same derivative. This 
is a fact-of-life in the theory of integration. Thankfully, notice that when working with definite 
integrals, the constants drop out, and thus are irrelevant to the outcome. 


f i f(x) de = (F(b) + C) — (F(a) + C) = F(b) — F(a). (18.5.11) 


This may seem like abstract non-sense, but let us take a moment to appreciate that it has given us 
a whole new perspective on computing integrals. Our goal is no-longer to do some sort of chop- 
and-sum process to try and recover the area, rather we need only find a function whose derivative 
is the function we have! This is incredible since we can now list many rather difficult integrals 
by just reversing the table from Section 18.3.2. For instance, we know that the derivative of x” is 
na”"—!, Thus, we can say using the fundamental theorem (18.5.10) that 


x£ 
1 ny”! dy = g” — 0” = r”. (18.5.12) 
0 
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Similarly, we know that the derivative of e” is itself, so that means 
x 
/ e dg = e — ed = e — 1. (18.5.13) 
0 


In this way, we can develop the entire theory of integration leveraging ideas from differential 
calculus freely. Every integration rule derives from this one fact. 


18.5.3 Change of Variables 


Just as with differentiation, there are a number of rules which make the computation of integrals 
more tractable. In fact, every rule of differential calculus (like the product rule, sum rule, and 
chain rule) has a corresponding rule for integral calculus (integration by parts, linearity of inte- 
gration, and the change of variables formula respectively). In this section, we will dive into what 
is arguably the most important from the list: the change of variables formula. 


First, suppose that we have a function which is itself an integral: 
x 
F(x) = l f(y) dy. (18.5.14) 
0 
Let us suppose that we want to know how this function looks when we compose it with another to 


obtain F(u(x)). By the chain rule, we know 


d dF du 
1 Fua) = Hua) + E. (18.5.15) 


We can turn this into a statement about integration by using the fundamental theorem (18.5.10) 
as above. This gives 


7 dF du 


F(u(z)) — F(u(0)) = f (ule) E dv. (18.5.16) 


Recalling that F is itself an integral gives that the left hand side may be rewritten to be 
ule) 2 dF du 
fly) dy = / —(u(y))- — dy. (18.5.17) 
Jo Ouf TEOT 
Similarly, recalling that F is an integral allows us to recognize that E 
theorem (18.5.10), and thus we may conclude 


= f using the fundamental 


I MOLE / Flu(y)) Sd (18.5.18) 


This is the change of variables formula. 


For a more intuitive derivation, consider what happens when we take an integral of f(u(x)) be- 
tween z and x + €. For a small e, this integral is approximately ef (u(x)), the area of the associated 
rectangle. Now, let us compare this with the integral of f(y) from u(x) to u(x + e). We know that 
u(x + €) ~ u(x) + eL (x), so the area of this rectangle is approximately e% (x) f(u(x)). Thus, to 
make the area of these two rectangles to agree, we need to multiply the first one by 22 (q) as is 
illustrated in Fig. 18.5.2. 
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fuo) fo) 


Reparametrize 
—— 


xX xX+E u(x) u(x+E) 
Fig. 18.5.2: Visualizing the transformation of a single thin rectangle under the change of variables. 


This tells us that 
z+E du u(a+e) 
f Flu(y)) (y) dy -f f(y) dy. (18.5.19) 
z dy u(z) 
This is the change of variables formula expressed for a single small rectangle. 


If u(x) and f(a) are properly chosen, this can allow for the computation of incredibly complex in- 





tegrals. For instance, if we even chose f(y) = 1 and u(x) = e~” (which means Se (2) =-—2ge "), 
this can show for instance that 
el 1 Š 
e7! =1= f 1 dy = -2 f ye” dy, (18.5.20) 
e70 0 
and thus by rearranging that 
1 ngi 
i ye dy = pe Ñ (18.5.21) 
0 2 


18.5.4 A Comment on Sign Conventions 


Keen-eyed readers will observe something strange about the computations above. Namely, com- 
putations like 


=1 


e 
f yeast (18.5.22) 
e—0 


can produce negative numbers. When thinking about areas, it can be strange to see a negative 
value, and so it is worth digging into what the convention is. 


Mathematicians take the notion of signed areas. This manifests itself in two ways. First, if we 
consider a function f(x) which is sometimes less than zero, then the area will also be negative. So 
for instance 


fo det. (18.5.23) 
0 


Similarly, integrals which progress from right to left, rather than left to right are also taken to be 
negative areas 


-1 
f l dz =-—1. (18.5.24) 
0 


The standard area (from left to right of a positive function) is always positive. Anything obtained 
by flipping it (say flipping over the x-axis to get the integral of a negative number, or flipping over 
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the y-axis to get an integral in the wrong order) will produce a negative area. And indeed, flipping 
twice will give a pair of negative signs that cancel out to have positive area 


fo wsi (18.5.25) 
0 


If this discussion sounds familiar, it is! In Section 18.1 we discussed how the determinant repre- 
sented the signed area in much the same way. 


18.5.5 Multiple Integrals 


In some cases, we will need to work in higher dimensions. For instance, suppose that we have 
a function of two variables, like f(x,y) and we want to know the volume under f when x ranges 
over |a, b] and y ranges over [c, d]. 


# Construct grid and compute function 

x, y = np.meshgrid(np.linspace(-2, 2, 101), np.linspace(-2, 2, 101), 
indexing='ij') 

z = np.exp(- x**2 - yxx2) 


# Plot function 

ax = d21.plt.figure().add_subplot(111, projection='3d’) 
ax.plot_wireframe(x, y, Z) 
d21.plt.xlabel(’x’) 
d21.plt.ylabel(’y’) 
d21.plt.xticks([-2, -1, 0, 1, 2]) 
d21.p1t.yticks([-2, -1, 0, 1, 2]) 
d21.set_figsize() 

ax.set_xlim(-2, 2) 
ax.set_ylim(-2, 2) 

ax.set_zlim(0, 1) 

ax.dist = 12 





We write this as 
J f(x,y) dz dy. (18.5.26) 
[a,b] x [c,d] 


Suppose that we wish to compute this integral. My claim is that we can do this by iteratively com- 
puting first the integral in x and then shifting to the integral in y, that is to say 


d b 
dx dy = dz ) dy. 5. 
as ey | (/ Hey) ») Y (18.5.27) 
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Let us see why this is. 


Consider the figure above where we have split the function into e x e squares which we will index 
with integer coordinates i, j. In this case, our integral is approximately 


de F (ci, ei) (18.5.28) 
ij 


Once we discretize the problem, we may add up the values on these squares in whatever order we 
like, and not worry about changing the values. This is illustrated in Fig. 18.5.3. In particular, we 
can say that 


Ne (= ses) . (18.5.29) 
j i 





Fig. 18.5.3: Illustrating how to decompose a sum over many squares as a sum over first the columns 
(1), then adding the column sums together (2). 


The sum on the inside is precisely the discretization of the integral 


b 
A 1 Hae da (18.5.30) 
Finally, notice that if we combine these two expressions we get 
d 
X Glej) ~ a G(y) dy = 1 f(x,y) dz dy. (18.5.31) 
F c [a,b] x [c,d] 


J 


Thus putting it all together, we have that 


d b 
dx dy = dz ) dy. 5. 
E. E J U Fay) z) y (18.5.32) 


Notice that, once discretized, all we did was rearrange the order in which we added a list of num- 
bers. This may make it seem like it is nothing, however this result (called Fubini’s Theorem) is not 
always true! For the type of mathematics encountered when doing machine learning (continu- 
ous functions), there is no concern, however it is possible to create examples where it fails (for 
example the function f(x, y) = xy(x? — y?) / (2? + y?)* over the rectangle [0, 2] x [0, 1]). 


Note that the choice to do the integral in x first, and then the integral in y was arbitrary. We could 
have equally well chosen to do y first and then z to see 


b d 
dx dy = dy ) dz. 5. 
Iss di I (J Hey) v) di (18.5.33) 
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Often times, we will condense down to vector notation, and say that for U = [a,b] x [c,d] this is 


| f(x) dx. (18.5.34) 
U 


18.5.6 Change of Variables in Multiple Integrals 


As with single variables in (18.5.18), the ability to change variables inside a higher dimensional 
integral is a key tool. Let us summarize the result without derivation. 


We need a function that reparameterizes our domain of integration. We can take this to be ¢ : 
R” — R”, that is any function which takes in n real variables and returns another n. To keep 
the expressions clean, we will assume that ¢ is injective which is to say it never folds over itself 


(p(x) = oy) x= y). 


In this case, we can say that 








f (x) dx -f Fo(x)) Idet(Do(x))| dx. (18.5.35) 
o(U) U 


where Dg is the Jacobian of ¢, which is the matrix of partial derivatives of p = 


(o1(21, tee tins ... EA ... SR 


Og1 ... Q 
Oxy Orn 
D@=)|) i “+. ij., (18.5.36) 
dpn ... Qn 
0x1 Orn 


Looking closely, we see that this is similar to the single variable chain rule (18.5.18), except we 
have replaced the term % (x) with |det(D¢(x))|. Let us see how we can to interpret this term. 
Recall that the du (q) term existed to say how much we stretched our z-axis by applying u. The 
same process in higher dimensions is to determine how much we stretch the area (or volume, or 
hyper-volume) of a little square (or little hyper-cube) by applying q. If @ was the multiplication by 


a matrix, then we know how the determinant already gives the answer. 


With some work, one can show that the Jacobian provides the best approximation to a multivari- 
able function ¢ at a point by a matrix in the same way we could approximate by lines or planes 
with derivatives and gradients. Thus the determinant of the Jacobian exactly mirrors the scaling 
factor we identified in one dimension. 


It takes some work to fill in the details to this, so do not worry if they are not clear now. Let us see 
at least one example we will make use of later on. Consider the integral 


f f et? de dy. (18.5.37) 


Playing with this integral directly will get us no-where, but if we change variables, we can make 
significant progress. If we let @(r,@) = (rcos(@),rsin(@)) (which is to say that x = rcos(0), y = 
rsin(0)), then we can apply the change of variable formula to see that this is the same thing as 


oo 2m 5 
f i e ldet(Dó(x))| dO dr, (18.5.38) 
0 0 
where 
Idet(D¢(x))| = [det bas Eb | = r(cos?(0) + sin?(0)) = r. (18.5.39) 
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Thus, the integral is 


ee) 27 co 
f / re"? dô dr = 2r | re"? dr = T, (18.5.40) 
0 0 0 


where the final equality follows by the same computation that we used in section Section 18.5.3. 


We will meet this integral again when we study continuous random variables in Section 18.6. 


Summary 


+ The theory of integration allows us to answer questions about areas or volumes. 


* The fundamental theorem of calculus allows us to leverage knowledge about derivatives to 
compute areas via the observation that the derivative of the area up to some point is given 
by the value of the function being integrated. 


e Integrals in higher dimensions can be computed by iterating single variable integrals. 


Exercises 


1. What is ff 4 da? 

2. Use the change of variables formula to integrate a x sin(x?) dz. 

3. What is So. xy dx dy? 

4. Use the change of variables formula to compute f? f} xy(x? — y?)/(2? + y?" dy dz and 
So Se fe, y) = xy(a? — y?) /(2? + y?)? dz dy to see they are different. 


Discussions? 


18.6 Random Variables 


In Section 2.6 we saw the basics of how to work with discrete random variables, which in our case 
refer to those random variables which take either a finite set of possible values, or the integers. 
In this section, we develop the theory of continuous random variables, which are random variables 
which can take on any real value. 


18.6.1 Continuous Random Variables 


Continuous random variables are a significantly more subtle topic than discrete random variables. 
A fair analogy to make is that the technical jump is comparable to the jump between adding lists 
of numbers and integrating functions. As such, we will need to take some time to develop the 
theory. 





25 https://discuss.d21.ai/t/414 
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From Discrete to Continuous 


To understand the additional technical challenges encountered when working with continuous 
random variables, let us perform a thought experiment. Suppose that we are throwing a dart at 
the dart board, and we want to know the probability that it hits exactly 2cm from the center ofthe 
board. 


To start with, we imagine measuring a single digit of accuracy, that is to say with bins for 0cm, 
1cm, 2cm, and so on. We throw say 100 darts at the dart board, and if 20 of them fall into the bin 
for 2cm we conclude that 20% of the darts we throw hit the board 2cm away from the center. 


However, when we look closer, this does not match our question! We wanted exact equality, 
whereas these bins hold all that fell between say 1.5cm and 2.5cm. 


Undeterred, we continue further. We measure even more precisely, say 1.9cm, 2.0cm, 2.1cm, and 
now see that perhaps 3 of the 100 darts hit the board in the 2.0cm bucket. Thus we conclude the 
probability is 3%. 


However, this does not solve anything! We have just pushed the issue down one digit further. Let 
us abstract a bit. Imagine we know the probability that the first k digits match with 2.00000... and 
we wantto know the probability it matches for the first k+1 digits. Itis fairly reasonable to assume 
thatthe k + 1 digit is essentially a random choice from the set (0,1,2,...,9). Atleast, we cannot 
conceive of a physically meaningful process which would force the number of micrometers away 
form the center to prefer to end in a 7 vs a 3. 


What this means is that in essence each additional digit of accuracy we require should decrease 
probability of matching by a factor of 10. Or put another way, we would expect that 


P(distance is 2.00... , to k digits) = p- 107". (18.6.1) 


The value p essentially encodes what happens with the first few digits, and the 107% handles the 
rest. 


Notice that if we know the position accurate to k = 4 digits after the decimal. that means we know 
the value falls within the interval say [(1.99995, 2.00005] which is an interval of length 2.00005 — 
1.99995 = 1074. Thus, if we call the length of this interval e, we can say 


P(distance is in an e-sized interval around 2) = €- p. (18.6.2) 


Let us take this one final step further. We have been thinking about the point 2 the entire time, 
but never thinking about other points. Nothing is different there fundamentally, but it is the case 
that the value p will likely be different. We would at least hope that a dart thrower was more likely 
to hit a point near the center, like 2cm rather than 20cm. Thus, the value p is not fixed, but rather 
should depend on the point x. This tells us that we should expect 


P(distance is in an e-sized interval around x) = e - p(x). (18.6.3) 


Indeed, (18.6.3) precisely defines the probability density function. It is a function p(x) which en- 
codes the relative probability of hitting near one point vs. another. Let us visualize what such a 
function might look like. 


%matplotlib inline 
from d21 import mxnet as d21 
from IPython import display 


(continues on next page) 
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(continued from previous page) 


from mxnet import np, npx 
npx.set_np() 


# Plot the probability density function for some random variable 

x = np.arange(-5, 5, 0.01) 

p = 0.2x*np.exp(-(x - 3)**2 / 2)/np.sqrt(2 * np.pi) + \ 
0.8xnp.exp(-(x + 1)**2 / 2)/np.sqrt(2 * np.pi) 


d21.plot(x, p, 'x', 'Density') 


0.3 
> 0.2 
v 
< 
(0) 
a 

0.1 

0.0 

-4 -2 0 2 4 
xX 


The locations where the function value is large indicates regions where we are more likely to find 
the random value. The low portions are areas where we are unlikely to find the random value. 


Probability Density Functions 


Let us now investigate this further. We have already seen what a probability density function is 
intuitively for a random variable X, namely the density function is a function p(x) so that 


P(X isin an e-sized interval around x) = €- p(x). (18.6.4) 


But what does this imply for the properties of p(x)? 
First, probabilities are never negative, thus we should expect that p(x) > 0 as well. 


Second, let us imagine that we slice up the R into an infinite number of slices which are e wide, say 
with slices (e-i, €- (i+1)]. For each of these, we know from (18.6.4) the probability is approximately 


P(X isin an e-sized interval around x) = e - p(e- i), (18.6.5) 


so summed over all of them it should be 


P(XER)= ¿$ e- ple- i). (18.6.6) 


This is nothing more than the approximation of an integral discussed in Section 18.5, thus we can 
say that 


co 


P(X ER) = f p(x) dz. (18.6.7) 


—00 
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We know that P(X € R) = 1, since the random variable must take on some number, we can 
conclude that for any density 


f roan (18.6.8) 


—00 


Indeed, digging into this further shows that for any a, and b, we see that 
b 
P(X € (a, b]) = f Pda (18.6.9) 


We may approximate this in code by using the same discrete approximation methods as before. 
In this case we can approximate the probability of falling in the blue region. 


# Approximate probability using numerical integration 

epsilon = 0.01 

x = np.arange(-5, 5, 0.01) 

p = Q.2*np.exp(-(x - 3)**2 / 2) / np.sqrt(2 * np.pi) + \ 
0.8xnp.exp(-(x + 1)**2 / 2) / np.sqrt(2 * np.pi) 


d21.set_figsize() 

d21.p1t.plot(x, p, color='black’') 

d21.plt.fill_between(x. tolist()[300:800], p.tolist()[300:800]) 
d21.p1t.show() 


f'approximate Probability: (np.sum(epsilon*xp[300:8007))' 


0.3 
0.2 


0.1 





0.0 


"approximate Probability: 0.7736172080039978' 


It turns out that these two properties describe exactly the space of possible probability density 
functions (or p.d.f’s for the commonly encountered abbreviation). They are non-negative func- 
tions p(x) > 0 such that 


[vo asa. (18.6.10) 


We interpret this function by using integration to obtain the probability our random variable is in 
a specific interval: 


P(X € (a,b) = f Oi (18.6.11) 
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In Section 18.8 we will see a number of common distributions, but let us continue working in the 
abstract. 


Cumulative Distribution Functions 


In the previous section, we saw the notion of the p.d.f. In practice, this is a commonly encountered 
method to discuss continuous random variables, but it has one significant pitfall: that the values 
of the p.d.f. are not themselves probabilities, but rather a function that we must integrate to yield 
probabilities. There is nothing wrong with a density being larger than 10, as long as it is not larger 
than 10 for more than an interval of length 1/10. This can be counter-intuitive, so people often 
also think in terms of the cumulative distribution function, or c.d.f., which is a probability. 


In particular, by using (18.6.11), we define the c.d.f. for a random variable X with density p(x) by 
F(2) = f HiP (18.6.12) 
Let us observe a few properties. 
e F(x) > 0as x > -—oo. 
e F(x) > las z > oœ. 
e F(x) is non-decreasing (y > x => F(y) > F(x)). 
e F(x) is continuous (has no jumps) if X is a continuous random variable. 


With the fourth bullet point, note that this would not be true if X were discrete, say taking the 
values 0 and 1 both with probability 1/2. In that case 


0 «<0, 
F(z)=4} 221 (18.6.13) 
1 z>1l. 


In this example, we see one of the benefits of working with the c.d.f., the ability to deal with con- 
tinuous or discrete random variables in the same framework, or indeed mixtures of the two (flip 
a coin: if heads return the roll of a die, if tails return the distance of a dart throw from the center 
of a dart board). 


Means 


Suppose that we are dealing with a random variables X. The distribution itself can be hard to 
interpret. It is often useful to be able to summarize the behavior of a random variable concisely. 
Numbers that help us capture the behavior of a random variable are called summary statistics. The 
most commonly encountered ones are the mean, the variance, and the standard deviation. 


The mean encodes the average value of a random variable. If we have a discrete random variable 
X, which takes the values x; with probabilities p;, then the mean is given by the weighted average: 
sum the values times the probability that the random variable takes on that value: 


px = E[X]= Ss Tipi. (18.6.14) 


The way we should interpret the mean (albeit with caution) is that it tells us essentially where the 
random variable tends to be located. 
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As a minimalistic example that we will examine throughout this section, let us take X to be the 
random variable which takes the value a — 2 with probability p, a +2 with probability p and a with 
probability 1 — 2p. We can compute using (18.6.14) that, for any possible choice of a and p, the 
mean is 


ux =E[X] = 2, rip; = (a — 2)p + a(1 — 2p) + (a + 2)p = a. (18.6.15) 


Thus we see that the mean is a. This matches the intuition since a is the location around which 
we centered our random variable. 


Because they are helpful, let us summarize a few properties. 
* For any random variable X and numbers a and b, we have that figx +5 = aux +b. 
* If we have two random variables X and Y, we have uxyy = ux + uy. 


Means are useful for understanding the average behavior of a random variable, however the mean 
is not sufficient to even have a full intuitive understanding. Making a profit of $10 + $1 per sale is 
very different from making $10 + $15 per sale despite having the same average value. The second 
one has a much larger degree of fluctuation, and thus represents a much larger risk. Thus, to 
understand the behavior ofa random variable, we will need at minimum one more measure: some 
measure of how widely a random variable fluctuates. 








Variances 


This leads us to consider the variance of a random variable. This is a quantitative measure of 
how far a random variable deviates from the mean. Consider the expression X — yx. This is the 
deviation of the random variable from its mean. This value can be positive or negative, so we need 
to do something to make it positive so that we are measuring the magnitude of the deviation. 


A reasonable thing to try is to look at |X — yx], and indeed this leads to a useful quantity called the 
mean absolute deviation, however due to connections with other areas of mathematics and statis- 
tics, people often use a different solution. 


In particular, they look at (X — px )?. If we look at the typical size of this quantity by taking the 
mean, we arrive at the variance 

ox = Var(X) = E [(X — px)] = E[X?] — py. (18.6.16) 
The last equality in (18.6.16) holds by expanding out the definition in the middle, and applying the 
properties of expectation. 


Let us look at our example where X is the random variable which takes the value a — 2 with prob- 
ability p, a + 2 with probability p and a with probability 1 — 2p. In this case yx = a, so all we need 
to compute is E | X*]. This can readily be done: 


E [X?] = (a — 2)°p + a*(1 — 2p) + (a +2)%p = a? + 8p. (18.6.17) 





Thus, we see that by (18.6.16) our variance is 
of = Var(X) = E[X?] — u4 = a? + 8p — a? = 8p. (18.6.18) 


This result again makes sense. The largest p can be is 1/2 which corresponds to picking a — 2 or 
a + 2 with a coin flip. The variance of this being 4 corresponds to the fact that both a — 2 and 
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a + 2 are 2 units away from the mean, and 2? = 4. On the other end of the spectrum, if p = 0, this 
random variable always takes the value 0 and so it has no variance at all. 


We will list a few properties of variance below: 
+ For any random variable X, Var(X) > 0, with Var(X) = 0 if and only if X is a constant. 
° For any random variable X and numbers a and b, we have that Var(aX + b) = a*Var(X). 


* If we have two independent random variables X and Y, we have Var(X + Y) = Var(X) + 
Var(Y). 


When interpreting these values, there can be a bit of a hiccup. In particular, let us try imagining 
what happens if we keep track of units through this computation. Suppose that we are working 
with the star rating assigned to a product on the web page. Then a, a— 2, and a+2 are all measured 
in units of stars. Similarly, the mean y x is then also measured in stars (being a weighted average). 
However, if we get to the variance, we immediately encounter an issue, which is we want to look at 
(X — ux )?, which is in units of squared stars. This means that the variance itself is not comparable 
to the original measurements. To make it interpretable, we will need to return to our original 
units. 


Standard Deviations 


This summary statistics can always be deduced from the variance by taking the square root! Thus 
we define the standard deviation to be 


ox = \/Var(X). (18.6.19) 


In our example, this means we now have the standard deviation is ox = 2\/2p. If we are dealing 
with units of stars for our review example, ox is again in units of stars. 


The properties we had for the variance can be restated for the standard deviation. 
+ For any random variable X, cx > 0. 


+ For any random variable X and numbers a and b, we have that 0,x +, = lalo x 


* If we have two independent random variables X and Y, we have ox4y = 4/0% +0%. 


Itis natural at this moment to ask, “Ifthe standard deviation is in the units of our original random 
variable, does it represent something we can draw with regards to that random variable?” The 
answer is a resounding yes! Indeed much like the mean told we the typical location of our random 
variable, the standard deviation gives the typical range of variation of that random variable. We 
can make this rigorous with what is known as Chebyshev's inequality: 


1 
P(X € lux —aox,px +acx]) < as (18.6.20) 


Or to state it verbally in the case of a = 10, 99% of the samples from any random variable fall 
within 10 standard deviations of the mean. This gives an immediate interpretation to our standard 
summary statistics. 


To see how this statement is rather subtle, let us take a look at our running example again where 
X is the random variable which takes the value a — 2 with probability p, a + 2 with probability p 
and a with probability 1 — 2p. We saw that the mean was a and the standard deviation was 2/2p. 
This means, if we take Chebyshev's inequality (18.6.20) with a = 2, we see that the expression is 


P (x d ja —4,/2p,a + 4/2p)) < DE (18.6.21) 
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This means that 75% of the time, this random variable will fall within this interval for any value of 
p. Now, notice that as p > 0, this interval also converges to the single point a. But we know that 
our random variable takes the values a — 2, a, and a + 2 only so eventually we can be certain a — 2 
and a + 2 will fall outside the interval! The question is, at what p does that happen. So we want to 
solve: for what p does a + 4,\/2p = a + 2, which is solved when p = 1/8, which is exactly the first p 
where it could possibly happen without violating our claim that no more than 1/4 of samples from 
the distribution would fall outside the interval (1/8 to the left, and 1/8 to the right). 


Let us visualize this. We will show the probability of getting the three values as three vertical bars 
with height proportional to the probability. The interval will be drawn as a horizontal line in the 
middle. The first plot shows what happens for p > 1/8 where the interval safely contains all points. 


# Define a helper to plot these figures 
def plot_chebyshev(a, p): 
d21.set_figsize() 
d21.plt.stem([a-2, a, at+2], [p, 1-2*p, p], use_line_collection=True) 
d21.plt.xlim([-4, 4]) 
d21.p1t.xlabel('x'> 
d21.p1t.ylabel('p.m.f.') 


d21.plt.hlines(@.5, a - 4 x np.sqrt(2 * p), 

a + 4 x np.sqrt(2 * p), ‘black’, lw=4) 
d21.plt.vlines(a - 4 x np.sqrt(2 * p), 0.53, 0.47, ‘black’, lw=1) 
d21.plt.vlines(a + 4 x np.sqrt(2 * p), 0.53, 0.47, ‘black’, lw=1) 
d21.p1t.title(f'p = {p:.3f}') 


d21.plt.show() 


# Plot interval when p > 1/8 
plot_chebyshev(0.0, 0.2) 


p = 0.200 
0.6 
0.4 
E 
a 
0.2 
0.0 
-4 -2 0 2 4 
Xx 


The second shows that at p = 1/8, the interval exactly touches the two points. This shows that the 
inequality is sharp, since no smaller interval could be taken while keeping the inequality true. 


# Plot interval when p = 1/8 
plot_chebyshev(0.0, 0.125) 
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p = 0.125 


0.6 
Z 0.4 
a 
0.2 
0.0 
-4 Ed 0 2 4 
x 


The third shows that for p < 1/8 the interval only contains the center. This does not invalidate the 
inequality since we only needed to ensure that no more than 1/4 of the probability falls outside 
the interval, which means that once p < 1/8, the two points at a — 2 and a + 2 can be discarded. 


# Plot interval when p < 1/8 
plot_chebyshev(0.0, 0.05) 


p = 0.050 

0.8 
0.6 
E 
ga 0.4 

0.2 

0.0 

-4 =2 0 2 4 
x 


Means and Variances in the Continuum 


This has all been in terms of discrete random variables, but the case of continuous random vari- 
ables is similar. To intuitively understand how this works, imagine that we split the real number 
line into intervals of length e given by (ei, «(i + 1)]. Once we do this, our continuous random vari- 
able has been made discrete and we can use (18.6.14) say that 


px & > (DPX € (ei, eli + 1)]) 


18.6.22 
N > (ipx (ei)e, ' ) 
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where px is the density of X. This is an approximation to the integral of xpx (1), so we can con- 
clude that 


[Lx al xpx(z) dz. (18.6.23) 
Similarly, using (18.6.16) the variance can be written as 
oe) 00 2 
Sfx |i = | px (x) dx — (J xpx(x) de) : (18.6.24) 


Everything stated above about the mean, the variance, and the standard deviation still applies in 
this case. For instance, if we consider the random variable with density 





1 x€ (0,1, 
x)= 18.6.25 
pe t otherwise. ( 
we can compute 
00 1 1 
Ux = / tps) dz = / x dz = =. (18.6.26) 
ee i 2 
and 
= y? 1 1 1 
2 2 
= d = A 18.6.27 
TX I. x p(x) dx (3) s hee ( ) 


As a warning, let us examine one more example, known as the Cauchy distribution. This is the 
distribution with p.d.f. given by 





= . 18.6.28 
p(x) 1472 ( ) 
# Plot the Cauchy distribution p.d.f. 
x = np.arange(-5, 5, 0.01) 
p=1/ (1 + xxx2) 
ORAL oz pa Xen ap sGet a) 
1.0 
0.8 
= 0.6 
O 
20.4 
0.2 
0.0 
—4 —2 0 2 4 
Xx 
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This function looks innocent, and indeed consulting a table of integrals will show it has area one 
under it, and thus it defines a continuous random variable. 


To see what goes astray, let us try to compute the variance of this. This would involve using 


(18.6.16) computing 

co q? 

f e (18.6.29) 
o Le 

The function on the inside looks like this: 

# Plot the integrand needed to compute the variance 

x = np.arange(-20, 20, 0.01) 

p = xxx2 / (1 + xxx2) 


d21.plot(x, p, 'x', ‘integrand’) 


1.0 


integrand 
oO O O O 
N > oO) 00 


ad 
o 


—20 —10 (0) 10 20 


This function clearly has infinite area under it since it is essentially the constant one with a small 
dip near zero, and indeed we could show that 


00 q? 

I 3 dr = 00. (18.6.30) 
o lee 

This means it does not have a well-defined finite variance. 


However, looking deeper shows an even more disturbing result. Let us try to compute the mean 
using (18.6.14). Using the change of variables formula, we see 


¡A 1/1 
y= f ao d= 5 | y du. (18.6.31) 





The integral inside is the definition of the logarithm, so this is in essence log(co) = oo, so there is 
no well-defined average value either! 


Machine learning scientists define their models so that we most often do not need to deal with 
these issues, and will in the vast majority of cases deal with random variables with well-defined 
means and variances. However, every so often random variables with heavy tails (that is those 
random variables where the probabilities of getting large values are large enough to make things 
like the mean or variance undefined) are helpful in modeling physical systems, thus it is worth 
knowing that they exist. 
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Joint Density Functions 


The above work all assumes we are working with a single real valued random variable. But what 
if we are dealing with two or more potentially highly correlated random variables? This circum- 
stance is the norm in machine learning: imagine random variables like R; ¿which encode the red 
value of the pixel at the (i, j) coordinate in an image, or P; which is a random variable given by 
a stock price at time t. Nearby pixels tend to have similar color, and nearby times tend to have 
similar prices. We cannot treatthem as separate random variables, and expect to create a success- 
ful model (we will see in Section 18.9 a model that under-performs due to such an assumption). 
We need to develop the mathematical language to handle these correlated continuous random 
variables. 


Thankfully, with the multiple integrals in Section 18.5 we can develop such a language. Suppose 
that we have, for simplicity, two random variables X, Y which can be correlated. Then, similar to 
the case of a single variable, we can ask the question: 


P(X is in an e-sized interval around x and Y is in an e-sized interval around y). (18.6.32) 


Similar reasoning to the single variable case shows that this should be approximately 


P(X is in an e-sized interval around x and Y is in an e-sized interval around y) ~ p(z, y), 
(18.6.33) 


for some function p(x, y). This is referred to as the joint density of X and Y. Similar properties 
are true for this as we saw in the single variable case. Namely: 


* p(x,y) > 0; 
* fke p(z, y) dz dy = 1; 
* P(X,Y) € D) = fp p(x, y) de dy. 


In this way, we can deal with multiple, potentially correlated random variables. If we wish to 
work with more than two random variables, we can extend the multivariate density to as many 
coordinates as desired by considering p(x) = p(11,...,tn). The same properties of being non- 
negative, and having total integral of one still hold. 


Marginal Distributions 
When dealing with multiple variables, we oftentimes want to be able to ignore the relationships 
and ask, “how is this one variable distributed?” Such a distribution is called a marginal distribution. 


To be concrete, let us suppose that we have two random variables X, Y with joint density given by 
Px y (x,y). We will be using the subscript to indicate what random variables the density is for. The 
question of finding the marginal distribution is taking this function, and using it to find px (x). 


As with most things, it is best to return to the intuitive picture to figure out what should be true. 
Recall that the density is the function px so that 


P(X € lx,1+e]) =e-px(x). (18.6.34) 


There is no mention of Y, but if all we are given is px y, we need to include Y somehow. We can 
first observe that this is the same as 


P(X € |z, +€], and Y €R) = e -px(x). (18.6.35) 
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Our density does not directly tell us about what happens in this case, we need to split into small 
intervals in y as well, so we can write this as 


e-px(x) ~X P(X € [x,x +e], and Y € [e -1,€- (1 +1)])) 
(18.6.36) 


i 
N 3. px y(x, € <i). 
i 





Joint Probability Marginal Probability 


Fig. 18.6.1: By summing along the columns of our array of probabilities, we are able to obtain the 
marginal distribution for just the random variable represented along the x-axis. 


This tells us to add up the value of the density along a series of squares in a line as is shown in Fig. 
18.6.1. Indeed, after canceling one factor of epsilon from both sides, and recognizing the sum on 
the right is the integral over y, we can conclude that 


px (x) = y epx, y (x, € - 1) 
i 


SD (18.6.37) 
~ | px,y (x,y) dy. 
Thus we see 
px(x) = l px Y (x, y) dy. (18.6.38) 


This tells us that to get a marginal distribution, we integrate over the variables we do not care 
about. This process is often referred to as integrating out or marginalized out the unneeded vari- 
ables. 


Covariance 


When dealing with multiple random variables, there is one additional summary statistic which 
is helpful to know: the covariance. This measures the degree that two random variable fluctuate 
together. 


Suppose that we have two random variables X and Y, to begin with, let us suppose they are dis- 
crete, taking on values (x;, yj) with probability p;;. In this case, the covariance is defined as 


oxy =Cov(X, Y) = Y (2; — xy; — wy) py. = ElX Y] — E[X]E[Y]. (18.6.39) 
ij 
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To think about this intuitively: consider the following pair of random variables. Suppose that X 
takes the values 1 and 3, and Y takes the values —1 and 3. Suppose that we have the following 
probabilities 


P(X =1and Y =-1) =F, 
] — 
P(X =1andY =3)= =”, 
ta (18.6.40) 
P(X =3andY =-1)= a 


P(X =3 and Y =3)=5, 


where p is a parameter in [0, 1] we get to pick. Notice that if p = 1 then they are both always 
their minimum or maximum values simultaneously, and if p = 0 they are guaranteed to take their 
flipped values simultaneously (one is large when the other is small and vice versa). If p = 1/2, 
then the four possibilities are all equally likely, and neither should be related. Let us compute the 
covariance. First, note yx = 2 and uy = 1, so we may compute using (18.6.39): 


Cov(X, Y) = Y (z; — wx) (yy — wy )Pij 








i 
sie DE +0 2)(3 NSP +0 at FP +0 2)(3 D? 
= 4p — 2. 

(18.6.41) 


When p = 1 (the case where they are both maximally positive or negative at the same time) has a 
covariance of 2. When p = 0 (the case where they are flipped) the covariance is —2. Finally, when 
p = 1/2 (the case where they are unrelated), the covariance is 0. Thus we see that the covariance 
measures how these two random variables are related. 


A quick note on the covariance is that it only measures these linear relationships. More complex 
relationships like X = Y? where Y is randomly chosen from {—2, —1, 0, 1, 2) with equal probabil- 
ity can be missed. Indeed a quick computation shows that these random variables have covariance 
zero, despite one being a deterministic function of the other. 


For continuous random variables, much the same story holds. Atthis point, we are pretty comfort- 
able with doing the transition between discrete and continuous, so we will provide the continuous 
analogue of (18.6.39) without any derivation. 


oxy = | (oax) - nyole: y) de dy. (18.6.42) 
R 
For visualization, let us take a look at a collection of random variables with tunable covariance. 


# Plot a few random variables adjustable covariance 
covs = [-0.9, 0.0, 1.2] 
d21.p1t.figure(figsize=(12, 3)) 
for i in range(3): 
X = np.random.normal(0, 1, 500) 
Y = covs[i]*X + np.random.normal(0, 1, (500)) 


d21.p1t.subplot(1, 4, i+1) 
d21.plt.scatter(X.asnumpy(), Y.asnumpy()) 
d21.p1t.xlabel('X'> 


(continues on next page) 
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(continued from previous page) 
d21.p1t.ylabel('Y'> 


d21.p1t.title(f'cov = {covs[i]}') 
d21.plt.show() 








Let us see some properties of covariances: 
* For any random variable X, Cov(X, X) = Var(X). 


e For any random variables X,Y and numbers a and b, Cov(aX + 6, Y) = Cov(X,aY + b) = 
aCov(X, Y). 


° If X and Y are independent then Cov(X, Y) = 0. 


In addition, we can use the covariance to expand a relationship we saw before. Recall that is X 
and Y are two independent random variables then 


Var(X + Y) = Var(X) + Var(Y). (18.6.43) 


With knowledge of covariances, we can expand this relationship. Indeed, some algebra can show 
that in general, 


Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). (18.6.44) 


This allows us to generalize the variance summation rule for correlated random variables. 


Correlation 


As we did in the case of means and variances, let us now consider units. If X is measured in one 
unit (say inches), and Y is measured in another (say dollars), the covariance is measured in the 
product of these two units inches x dollars. These units can be hard to interpret. What we will 
often want in this case is a unit-less measurement of relatedness. Indeed, often we do not care 
about exact quantitative correlation, but rather ask if the correlation is in the same direction, and 
how strong the relationship is. 


To see what makes sense, let us perform a thought experiment. Suppose that we convert our ran- 
dom variables in inches and dollars to be in inches and cents. In this case the random variable Y is 
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multiplied by 100. If we work through the definition, this means that Cov(X, Y ) will be multiplied 
by 100. Thus we see that in this case a change of units change the covariance by a factor of 100. 
Thus, to find our unit-invariant measure of correlation, we will need to divide by something else 
that also gets scaled by 100. Indeed we have a clear candidate, the standard deviation! Indeed if 
we define the correlation coefficient to be 

p(X, Y) = LOS) (18.6.45) 

Oxoy 

we see that this is a unit-less value. A little mathematics can show that this number is between —1 
and 1 with 1 meaning maximally positively correlated, whereas —1 means maximally negatively 
correlated. 


Returning to our explicit discrete example above, we can see that ox = 1 and cy = 2, so we can 
compute the correlation between the two random variables using (18.6.45) to see that 





Ap —2 
p(X, Y) = ; 5 = pl. (18.6.46) 
This now ranges between —1 and 1 with the expected behavior of 1 meaning most correlated, and 
—1 meaning minimally correlated. 


As another example, consider X as any random variable, and Y = aX + bas any linear determin- 
istic function of X. Then, one can compute that 


OY = Oax +b = lulex, (18.6.47) 
Cov(X, Y) = Cov(X,aX + b) =aCov(X, X) = aVar( X), (18.6.48) 
and thus by (18.6.45) that 
Var(X 
p(X,Y) = ES = L = sign(a). (18.6.49) 
lalox lal 


Thus we see that the correlation is +1 for any a > 0, and —1 for any a < 0 illustrating that corre- 
lation measures the degree and directionality the two random variables are related, not the scale 
that the variation takes. 


Let us again plot a collection of random variables with tunable correlation. 


# Plot a few random variables adjustable correlations 
cors = [-0.9, 0.0, 1.0] 
d21.p1t.figure(figsize=(12, 3)) 
for i in range(3): 
X = np.random.normal(0, 1, 500) 
Y = cors[i] * X + np.sqrt(1 - cors[i]**2) * np.random.normal(0, 1, 500) 


d21.p1t.subplot(1, 4, i + 1) 

d21.plt.scatter(X.asnumpy(), Y.asnumpy()) 

d21.p1t.xlabel('X'> 

d21.p1t.ylabel('Y'> 

d21.p1t.title(f'cor = {cors[i]}') 
d21.plt.show() 
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cor = 1.0 
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Let us list a few properties of the correlation below. 
* For any random variable X, p(X, X) = 1. 
e For any random variables X,Y and numbers a and b, p(aX +b, Y) = p(X,aY +b) = p(X, Y). 
° If X and Y are independent with non-zero variance then p(X, Y) = 0. 


As a final note, you may feel like some of these formulae are familiar. Indeed, if we expand ev- 
erything out assuming that yx = uy = 0, we see that this is 
ig ViYiPij 

p(X, Y) = wd (18.6.50) 


Di UF Pij Dis Y Pij 


This looks like a sum of a product of terms divided by the square root of sums of terms. This is 
exactly the formula for the cosine of the angle between two vectors v, w with the different coordi- 
nates weighted by p;;: 








vW Par ViWi 


MIM ES 02 (18.6.51) 


Indeed if we think of norms as being related to standard deviations, and correlations as being 
cosines of angles, much of the intuition we have from geometry can be applied to thinking about 
random variables. 








cos(0) 


Summary 


* Continuous random variables are random variables that can take on a continuum of values. 
They have some technical difficulties that make them more challenging to work with com- 
pared to discrete random variables. 


+ The probability density function allows us to work with continuous random variables by 
giving a function where the area under the curve on some interval gives the probability of 
finding a sample point in that interval. 


* The cumulative distribution function is the probability of observing the random variable to 
be less than a given threshold. It can provide a useful alternate viewpoint which unifies 
discrete and continuous variables. 
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The mean is the average value of a random variable. 


The variance is the expected square of the difference between the random variable and its 
mean. 


The standard deviation is the square root of the variance. It can be thought of as measuring 
the range of values the random variable may take. 


Chebyshev's inequality allows us to make this intuition rigorous by giving an explicitinterval 
that contains the random variable most of the time. 


Joint densities allow us to work with correlated random variables. We may marginalize joint 
densities by integrating over unwanted random variables to get the distribution of the de- 
sired random variable. 


The covariance and correlation coefficient provide a way to measure any linear relationship 
between two correlated random variables. 


Exercises 


1. Suppose that we have the random variable with density given by p(x) = 4 for x > land 
p(x) = 0 otherwise. What is P(X > 2)? 


2. The Laplace distribution is a random variable whose density is given by p(x = se lal. What 
is the mean and the standard deviation of this function? As a hint, ie ze” dx = land 
jg Pe de =2. 


3. I walk up to you on the street and say “I have a random variable with mean 1, standard devi- 
ation 2, and I observed 25% of my samples taking a value larger than 9.” Do you believe me? 
Why or why not? 


4. Suppose that you have two random variables X, Y, with joint density given by pxy(x, y) = 
Ary for x,y € [0, 1] and pxy (z, y) = 0 otherwise. What is the covariance of X and Y? 


Discussions2*° 


18.7 Maximum Likelihood 


One of the most commonly encountered way of thinking in machine learning is the maximum 
likelihood point of view. This is the concept that when working with a probabilistic model with 
unknown parameters, the parameters which make the data have the highest probability are the 
most likely ones. 





46 https://discuss.d21.ai/t/415 
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18.7.1 The Maximum Likelihood Principle 


This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a 
model with parameters O and a collection of data examples X. For concreteness, we can imagine 
that 0 is a single value representing the probability that a coin comes up heads when flipped, and 
X is a sequence of independent coin flips. We will look at this example in depth later. 


If we want to find the most likely value for the parameters of our model, that means we want to 
find 


argmax P(0 | X). (18.7.1) 
By Bayes' rule, this is the same thing as 
P(X |0)P(0) 


The expression P(X), a parameter agnostic probability of generating the data, does not depend 
on @ at all, and so can be dropped without changing the best choice of 6. Similarly, we may now 
posit that we have no prior assumption on which set of parameters are better than any others, so 
we may declare that P(@) does not depend on theta either! This, for instance, makes sense in our 
coin flipping example where the probability it comes up heads could be any value in (0, 1] without 
any prior belief it is fair or not (often referred to as an uninformative prior). Thus we see that our 
application of Bayes’ rule shows that our best choice of 0 is the maximum likelihood estimate for 
0: 


0 = argmax P(X |8). (18.7.3) 


As a matter of common terminology, the probability of the data given the parameters (P(X | 0)) 
is referred to as the likelihood. 


A Concrete Example 


Let us see how this works in a concrete example. Suppose that we have a single parameter 0 
representing the probability that a coin flip is heads. Then the probability of getting a tails is 1 — 0, 
and so if our observed data X is a sequence with ny heads and ny tails, we can use the fact that 
independent probabilities multiply to see that 


P(X | 0) =0"x(1-9y7, (18.7.4) 


If we flip 13 coins and get the sequence “HHHTHTTHHHHHT”, which has ny = 9 and ny = 4, we 
see that this is 


P(X | 6) =0°(1-6)*. (18.7.5) 


One nice thing about this example will be that we know the answer going in. Indeed, if we said 
verbally, “I flipped 13 coins, and 9 came up heads, what is our best guess for the probability that 
the coin comes us heads?,” everyone would correctly guess 9/13. What this maximum likelihood 
method will give us is a way to get that number from first principals in a way that will generalize 
to vastly more complex situations. 


For our example, the plot of P(X | 0) is as follows: 
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%matplotlib inline 

from d21 import mxnet as d21 

from mxnet import autograd, np, npx 
npx.set_np() 


theta = np.arange(0, 1, 0.001) 
p = thetaxx9 x (1 - theta)xx4. 


d21.plot(theta, p, ‘theta’, 'likelihood’) 
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8 
2 0.0002 
£ 
U 
e 
— 0.0001 

0.0000 

0.0 0.2 0.4 0.6 0.8 1.0 
theta 
This has its maximum value somewhere near our expected 9/13 = 0.7.... To see if it is exactly 


there, we can turn to calculus. Notice that at the maximum, the function is flat. Thus, we could 
find the maximum likelihood estimate (18.7.1) by finding the values of 0 where the derivative is 
zero, and finding the one that gives the highest probability. We compute: 


d 
0= Z P(X | 6) 
d 
= 0-0 (18.7.6) 





= 90°(1 — 6)* — 40° (1 — 6)3 
= 68(1 = 9/70 = 136). 


This has three solutions: 0, 1 and 9/13. The first two are clearly minima, not maxima as they assign 
probability 0 to our sequence. The final value does not assign zero probability to our sequence, 
and thus must be the maximum likelihood estimate 6 = 9/13. 


18.7.2 Numerical Optimization and the Negative Log-Likelihood 


The previous example is nice, but what if we have billions of parameters and data examples. 


First notice that, if we make the assumption that all the data examples are independent, we can 
no longer practically consider the likelihood itself as it is a product of many probabilities. Indeed, 
each probability is in [0, 1], say typically of value about 1/2, and the product of (1/2) 1000000 is far 
below machine precision. We cannot work with that directly. 


However, recall that the logarithm turns products to sums, in which case 


log ( (1/2) 1000000) — 1000000000 - log(1/2)  —301029995.6... (18.7.7) 
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This number fits perfectly within even a single precision 32-bit float. Thus, we should consider 
the log-likelihood, which is 


log(P(X | @)). (18.7.8) 


Since the function x +> log(x) is increasing, maximizing the likelihood is the same thing as maxi- 
mizing the log-likelihood. Indeed in Section 18.9 we will see this reasoning applied when working 
with the specific example of the naive Bayes classifier. 


We often work with loss functions, where we wish to minimize the loss. We may turn maximum 
likelihood into the minimization of a loss by taking —log(P(X | 0)), which is the negative log- 
likelihood. 


To illustrate this, consider the coin flipping problem from before, and pretend that we do not know 
the closed form solution. We may compute that 


—log(P(X | 0)) = —log(0"* (1 — 6)"") = —(ny log(0) + ny log(1 — 0)). (18.7.9) 
This can be written into code, and freely optimized even for billions of coin flips. 
et up our data 


8675309 
= 25624 


A ae (04) 
ll 


# Initialize our paramteres 
theta = np.array(0.5) 
theta. attach_grad() 


# Perform gradient descent 
lr = 0.00000000001 
for iter in range(10): 
with autograd.record(): 
loss = -(n_H * np.log(theta) + n_T x np.log(1 - theta)) 
loss. backward() 
theta -= Ir x theta.grad 


# Check output 
theta, n_H / (n_H + n_T) 


(array(0.50172704), 0.9970550284664874) 


Numerical convenience is only one reason people like to use negative log-likelihoods. Indeed, 
there are a several reasons that it can be preferable. 


The second reason we consider the log-likelihood is the simplified application of calculus rules. As 
discussed above, due to independence assumptions, most probabilities we encounter in machine 
learning are products of individual probabilities. 


P(X |0) = plzi | 0) - p(x2 | 0) ---p(tn | 0). (18.7.10) 
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This means that if we directly apply the product rule to compute a derivative we get 


SPX 18) = (E Ple 10)) «Peel 6):--P(2n18) 


ð 
+ Pler 10): (Ple 0)) = Pen 8) 
(18.7.11) 
Per 18) -Píaz 10) (Plen 0) 
1 2 00 n 
This requires n(n — 1) multiplications, along with (n — 1) additions, so it is total of quadratic time 


in the inputs! Sufficient cleverness in grouping terms will reduce this to linear time, but it requires 
some thought. For the negative log-likelihood we have instead 


—log (P(X | 0)) = —log(P(a1 | @)) — log(P(a2 | @))---— log(P(2n | 0)), (18.7.12) 


which then gives 





-Sog (PX | 8) = 3 Grelo) ++ Brg Grel) 087.13 


This requires only n divides and n — 1 sums, and thus is linear time in the inputs. 


The third and final reason to consider the negative log-likelihood is the relationship to information 
theory, which we will discuss in detail in Section 18.11. This is a rigorous mathematical theory 
which gives a way to measure the degree of information or randomness in a random variable. 
The key object of study in that field is the entropy which is 


— X pilogs(pi); (18.7.14) 


which measures the randomness of a source. Notice that this is nothing more than the average 
— log probability, and thus if we take our negative log-likelihood and divide by the number of data 
examples, we get a relative of entropy known as cross-entropy. This theoretical interpretation 
alone would be sufficiently compelling to motivate reporting the average negative log-likelihood 
over the dataset as a way of measuring model performance. 


18.7.3 Maximum Likelihood for Continuous Variables 
Everything that we have done so far assumes we are working with discrete random variables, but 
what if we want to work with continuous ones? 


The short summary is that nothing at all changes, except we replace all the instances of the proba- 
bility with the probability density. Recalling that we write densities with lower case p, this means 
that for example we now say 


— log (p(X | @)) = —log(p(=1 | 6)) — log(p(x2 | @))--- — log(p(xn | 0)) =~ Lloete (x; | 0)) 
(18.7.15) 


The question becomes, “Why is this OK?” After all, the reason we introduced densities was because 
probabilities of getting specific outcomes themselves was zero, and thus is not the probability of 
generating our data for any set of parameters zero? 
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Indeed, this is the case, and understanding why we can shift to densities is an exercise in tracing 
what happens to the epsilons. 


Let us first re-define our goal. Suppose that for continuous random variables we no longer want 
to compute the probability of getting exactly the right value, but instead matching to within some 


range e. For simplicity, we assume our data is repeated observations z1, ..., xy of identically dis- 
tributed random variables X,,..., Xy. As we have seen previously, this can be written as 
P(X, € [21,21 + €l, Xo € [12,22 +€l,..., Xy E [ty, TN +€ 0 
(xel |, Xe | | | 118) or 


ze p(z1 | 0) - p(x2 | 0) ---p(an | 0). 
Thus, if we take negative logarithms of this we obtain 


—log(P(X1 € [21,21 + €], X2 € [22,22 + €],..., Xn E [ew, en + el | 6)) 
~ — Nlog(e) — X` log(p(z; | 0)). (18.7.17) 


If we examine this expression, the only place that the e occurs is in the additive constant —N log(e). 
This does not depend on the parameters 0 at all, so the optimal choice of O does not depend on 
our choice of e! If we demand four digits or four-hundred, the best choice of 0 remains the same, 
thus we may freely drop the epsilon to see that what we want to optimize is 


= > log(p(zi | 0)). (18.7.18) 


Thus, we see that the maximum likelihood point of view can operate with continuous random 
variables as easily as with discrete ones by replacing the probabilities with probability densities. 


Summary 
* The maximum likelihood principle tells us that the best fit model for a given dataset is the 
one that generates the data with the highest probability. 


e Often people work with the negative log-likelihood instead for a variety of reasons: numer- 
ical stability, conversion of products to sums (and the resulting simplification of gradient 
computations), and theoretical ties to information theory. 


e While simplest to motivate in the discrete setting, it may be freely generalized to the contin- 
uous setting as well by maximizing the probability density assigned to the datapoints. 


Exercises 


1. Suppose that you know that a random variable has density +e~°” for some value a. You 
obtain a single observation from the random variable which is the number 3. What is the 
maximum likelihood estimate for a? 


2. Suppose that you have a dataset of samples {2;}_, drawn from a Gaussian with unknown 
mean, but variance 1. What is the maximum likelihood estimate for the mean? 


Discussions?“ 





47 https://discuss.d21.ai/t/416 
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18.8 Distributions 


Now that we have learned how to work with probability in both the discrete and the continuous 
setting, let us get to know some of the common distributions encountered. Depending on the area 
of machine learning, we may need to be familiar with vastly more of these, or for some areas of 
deep learning potentially none at all. This is, however, a good basic list to be familiar with. Let us 
first import some common libraries. 


%matplotlib inline 

from d21 import mxnet as d21 
from IPython import display 
from math import erf, factorial 
import numpy as np 


18.8.1 Bernoulli 


This is the simplest random variable usually encountered. This random variable encodes a coin 
flip which comes up 1 with probability p and 0 with probability 1 — p. If we have a random variable 
X with this distribution, we will write 


X ~ Bernoulli(p). (18.8.1) 


The cumulative distribution function is 


0 x <0, 
F(z)=<1-p 0<2<1, (18.8.2) 
1 o> Sc. 


The probability mass function is plotted below. 
p = 0.3 


d21.set_figsize() 

d21.p1t.stem([o, 1], [1 - p, pl], use_line_collection=True) 
d21.plt.xlabel(’x’) 

d21.p1t.ylabel('p.m.f.') 

d21.p1t.show() 


0.0 0.2 0.4 0.6 0.8 1.0 
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Now, let us plot the cumulative distribution function (18.8.2). 
x = np.arange(-1, 2, 0.01) 


def F(x): 
return 0 if x < 0 else 1 if x > 1 else 1 - p 


d21.plot(x, np.array([F(y) for y in x]), ‘x’, 'c.d.f.') 
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If X ~ Bernoulli(p), then: 
* HX =P, 
ox = p(l—p). 


We can sample an array of arbitrary shape from a Bernoulli random variable as follows. 


1x(np.random.rand(10, 10) < p) 
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IO, Os Ge Oy Oy @5 We tly Oy Ml, 
Eas le tly he a Oak. 
Tos Uy ds tty Oy Uy UD, 2, o Gl, 
WO; Oy Ue O, By Oy Uy tly Wy Ml, 
WO, De O, ty O, Gy 0, Ga O) 111) 
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18.8.2 Discrete Uniform 


The next commonly encountered random variable is a discrete uniform. For our discussion here, 
we will assume that it is supported on the integers [1,2,...,n), however any other set of values 
can be freely chosen. The meaning of the word uniform in this context is that every possible value 
is equally likely. The probability for each value i € {1,2,3,...,n} is p; = +. We will denote a 
random variable X with this distribution as 


X~U(n). (18.8.3) 
The cumulative distribution function is 
0 «<i, 
F(a)=4f k<a<k+1withl<k<n, (18.8.4) 
1 z>=n. 


Let us first plot the probability mass function. 

n=5 

d21.p1t.stem([i+1 for i in range(n)], nx[1 / n], use_line_collection=True) 
d21.p1t.xlabel('x'> 


d21.p1t.ylabel('p.m.f.') 
d21.p1t.show() 
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Now, let us plot the cumulative distribution function (18.8.4). 
x = np.arange(-1, 6, 0.01) 


def F(x): 
return 0 if x < 1 else 1 if x > n else np.floor(x) / n 


d21.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.') 
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If X ~ U(n), then: 


š — lin 
Ux = 7 


ar22 — n-i 
OX =T" 


We can sample an array of arbitrary shape from a discrete uniform random variable as follows. 


np.random.randint(1, n, size=(10, 10)) 


array(L 
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18.8.3 Continuous Uniform 


Next, let us discuss the continuous uniform distribution. The idea behind this random variable 
is that if we increase the n in the discrete uniform distribution, and then scale it to fit within the 
interval |a, b], we will approach a continuous random variable that just picks an arbitrary value in 
(a, b] all with equal probability. We will denote this distribution as 


X ~ U(a,b). (18.8.5) 
The probability density function is 
= b 
pla) = 47 > Slot (18.8.6) 
0 x E la, b]. 
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The cumulative distribution function is 


0 T<a, 
F(x) = 4 = we [a,b], (18.8.7) 
1 x>=b. 


Let us first plot the probability density function (18.8.6). 
a, b=1, 3 


np.arange(0, 4, 0.01) 
(x > a)*(x < b)/(b - a) 
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Now, let us plot the cumulative distribution function (18.8.7). 


def F(x): 
return 0 if x < a else 1 if x > b else (x - a) / (b - a) 


d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.') 
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If X ~ U(a,b), then: 
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a+b 


° XST’ 
. 72. _ (b-a)? 
x= T 


We can sample an array of arbitrary shape from a uniform random variable as follows. Note that 
it by default samples from a U (0, 1), so if we want a different range we need to scale it. 


(b - a) * np.random.rand(10, 10) + a 











array([[1.67557501, 2.89605204, 2.34332438, 1.62353419, 1.66541822, 

2.87781563, 1.92747163, 2.09800617, 2.782816 , 2.91226822], 
[2.42536269, 1.5747761 , 1.31470515, 1.56244488, 1.86499132, 
1.07692837, 2.24743309, 2.69692693, 1.39804837, 2.65831216], 
[2.86108252, 1.72043635, 1.19161907, 1.0759084 , 2.96016755, 
2.8101376 , 1.97033213, 1.66967807, 2.53385283, 2.52028154], 
[1.0432746 , 1.29154623, 2.11471293, 1.52306307, 2.10547512, 
2.11739503, 1.13781719, 1.05407917, 1.82900499, 2.06417557], 
[1.46798281, 1.75550318, 2.23118697, 1.14063127, 1.32858305, 
2.35958101, 1.0954111 , 2.37208249, 1.34418126, 1.35545937], 
[1.29081533, 2.54844965, 2.62693091, 2.86451481, 1.84266316, 
1.49112427, 2.17715764, 2.16526244, 2.76717851, 2.79337097], 
[1.91151257, 2.69636591, 2.10064109, 2.44397493, 2.20145546, 
1.64616836, 2.92604195, 1.46960478, 2.17627248, 2.73407559], 
[2.01956931, 2.86908562, 1.09838225, 1.4672259 , 2.96316445, 
1.85724057, 2.72523628, 1.15218265, 1.81706566, 1.5432301 ], 
[2.25150714, 2.97641019, 1.80615349, 1.0506879 , 2.69772445, 
1.74565778, 1.31719152, 2.77733396, 2.94051497, 2.39506607], 
[1.0194495 , 1.39038889, 2.50354911, 1.69565606, 1.07312304, 
2.65441135, 2.59194674, 2.16550446, 1.50224748, 2.89927366]]) 


18.8.4 Binomial 


Let us make things a little more complex and examine the binomial random variable. This random 
variable originates from performing a sequence of nindependent experiments, each of which has 
probability p of succeeding, and asking how many successes we expect to see. 


Let us express this mathematically. Each experiment is an independent random variable X; where 
we will use 1 to encode success, and 0 to encode failure. Since each is an independent coin flip 
which is successful with probability p, we can say that X; ~ Bernoulli(p). Then, the binomial 
random variable is 


x= SX (18.8.8) 
i=1 
In this case, we will write 
X ~ Binomial(n, p). (18.8.9) 


To get the cumulative distribution function, we need to notice that getting exactly k successes can 


occur in (7) = AoE ways each of which has a probability of p*(1— p)"~* of occurring. Thus the 


cumulative distribution function is 


0 xz <0, 
F(z) = 4 mck [mjor (1 py ™ k<zu<k+1with0<k<n, (18.8.10) 
1 T>=N. 
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Let us first plot the probability mass function. 
n, p= 10, 0.2 


# Compute binomial coefficient 
def binom(n, k): 
comb = 1 
for i in range(min(k, n - k)): 
comb = comb * (n - i) // (i + 1) 
return comb 


pmf = np.array([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)]) 
d21.p1t.stem([i for i in range(n + 1)], pmf, use_line_collection=True) 
d21.plt.xlabel(’x’) 


d21.plt.ylabel(’p.m.f.') 
d21.plt.show() 
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Now, let us plot the cumulative distribution function (18.8.10). 


X = np.arange(-1, 11, 0.01) 
cmf = np.cumsum(pmf) 


def F(x): 
return 0 if x < 0 else 1 if x > n else cmf[int(x)] 


d21.plot(x, np.array([F(y) for y in x.tolistQ]), 'x', ‘c.d.f.’) 
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While this result is not simple, the means and variances are. If X ~ Binomial(n, p), then: 
* HX = NP, 
* o% = np(1—p). 


This can be sampled as follows. 


np.random.binomial(n, p, size=(10, 10)) 
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18.8.5 Poisson 


Let us now perform a thought experiment. We are standing at a bus stop and we want to know how 
many buses will arrive in the next minute. Let us start by considering X® ~ Bernoulli(p) which 
is simply the probability that a bus arrives in the one minute window. For bus stops far from an 
urban center, this might be a pretty good approximation. We may never see more than one bus in 
a minute. 


However, if we are in a busy area, it is possible or even likely that two buses will arrive. We can 
model this by splitting our random variable into two parts for the first 30 seconds, or the second 
30 seconds. In this case we can write 


DAD, (18.8.11) 


where X( is the total sum, and x) ~ Bernoulli(p/2). The total distribution is then X®) ~ 
Binomial(2, p/2). 





900 Chapter 18. Appendix: Mathematics for Deep Learning 


Why stop here? Let us continue to split that minute into n parts. By the same reasoning as above, 
we see that 


X( ~ Binomial(n, p/n). (18.8.12) 


Consider these random variables. By the previous section, we know that (18.8.12) has mean 
Hxm = n(p/n) = p, and variance 0%.) = n(p/n)(1 — (p/n)) = p(1 — p/n). If we take n —> oo, 
we can see that these numbers stabilize to .y(~) = p, and variance ee = p. This indicates that 
there could be some random variable we can define in this infinite subdivision limit. 


This should not come as too much of a surprise, since in the real world we can just count the 
number of bus arrivals, however it is nice to see that our mathematical model is well defined. 
This discussion can be made formal as the law of rare events. 


Following through this reasoning carefully, we can arrive at the following model. We will say that 
X ~ Poisson(A) if it is a random variable which takes the values {0, 1, 2,...} with probability 


AREA 
Pk = kl 


The value \ > 0 is known as the rate (or the shape parameter), and denotes the average number of 
arrivals we expect in one unit of time. 


(18.8.13) 





We may sum this probability mass function to get the cumulative distribution function. 


0 0 
F(x) = — ~a ] (18.8.14) 
emo k<uc<k+1with0< k. 


Let us first plot the probability mass function (18.8.13). 
lam = 5.0 


xs = [i for i in range(20)] 
pmf = np.array([np.exp(-lam) * lamx*k / factorial(k) for k in xs]) 


d21.plt.stem(xs, pmf, use_line_collection=True) 
d21.p1t.xlabel('x'> 

d21.p1t.ylabel('p.m.f.') 

d21.p1t.show() 


0,15 
= 0.10 
E 
o 

0.05 

0.00 

0 5 10 15 
x 


Now, let us plot the cumulative distribution function (18.8.14). 
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X = np.arange(-1, 21, 0.01) 
cmf = np.cumsum(pmf) 
def F(x): 
return 0 if x < 0 else 1 if x > n else cmf[int(x)] 


d21.plot(x, np.array([F(y) for y in x.tolistQ)]), 'x', 'c.d.f.') 
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As we saw above, the means and variances are particularly concise. If X ~ Poisson(A), then: 
° x =A, 
eA, 


This can be sampled as follows. 


np.random.poisson(lam, size=(10, 10)) 
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18.8.6 Gaussian 


Now Let us try a different, but related experiment. Let us say we again are performing n indepen- 
dent Bernoulli(p) measurements X;. The distribution of the sum of these is X(Y ~ Binomial(n, p). 
Rather than taking a limit as n increases and p decreases, Let us fix p, and then send n > oo. In 
this case Hyn) = np > oo and O? en) = np(1 — p) > œ, so there is no reason to think this limit 
should be well defined. 


However, not all hope is lost! Let us just make the mean and variance be well behaved by defining 
(n) _ 
yin) X" Bx (18.8.15) 
O x(n) 
This can be seen to have mean zero and variance one, and so it is plausible to believe that it will 
converge to some limiting distribution. If we plot what these distributions look like, we will be- 
come even more convinced that it will work. 


p = 0.2 
ns = [1, 10, 100, 1000] 
d21.p1t.figure(figsize=(10, 3)) 
for i in range(4): 
n = ns[i] 
pmf = np.array(L[p**i * (1-p)**(n-i) * binom(n, i) for i in range(n + 1)]) 
d21.plt.subplot(1, 4, i + 1) 
d21.p1t.stem([(i - nxp)/np.sqrt(nxpx(1 - p)) for i in range(n + 1)], pmf, 
use_line_collection=True) 
d21.plt.xlim([-4, 4]) 
d21.plt.xlabel(’x') 
d21.plt.ylabel(’p.m.f.') 
d21.plt.title("n = ()”.format(n)) 
d21.plt.show() 
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One thing to note: compared to the Poisson case, we are now dividing by the standard deviation 
which means that we are squeezing the possible outcomes into smaller and smaller areas. This is 
an indication that our limit will no longer be discrete, but rather a continuous. 


A derivation of what occurs is beyond the scope of this document, but the central limit theorem 
states that as n > ov, this will yield the Gaussian Distribution (or sometimes normal distribution). 
More explicitly, for any a, b: 

lim P(Y™) € [a,b]) = P(N(0, 1) € [a,b]), (18.8.16) 


n—>00 
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where we say a random variable is normally distributed with given mean y and variance o°, writ- 
ten X ~ N (u, 0?) if X has density 


1 _ G-p? 
px(z)= et, (18.8.17) 


V 2102 


Let us first plot the probability density function (18.8.17). 





mu, sigma = 0, 1 


x = np.arange(-3, 3, 0.01) 
p = 1 / np.sqrt(2 x np.pi * sigmax*2) x np.exp(-(x - mu)**2 / (2 x sigmax*xx2)) 
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Now, let us plot the cumulative distribution function. It is beyond the scope of this appendix, but 
the Gaussian c.d.f. does not have a closed-form formula in terms of more elementary functions. 
We will use erf which provides a way to compute this integral numerically. 


def phi(x): 
return (1.0 + erf((x - mu) / (sigma * np.sqrt(2)))) / 2.0 


d21.plot(x, np.array([phi(y) for y in x.tolistQ)]), ‘x’, 'c.d.f.') 


1.0 


0.8 
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Keen-eyed readers will recognize some of these terms. Indeed, we encountered this integral in 
Section 18.5. Indeed we need exactly that computation to see that this px(x) has total area one 
and is thus a valid density. 


Our choice of working with coin flips made computations shorter, but nothing about that choice 
was fundamental. Indeed, if we take any collection of independent identically distributed random 
variables X;, and form 


N 
XN) = y Mas (18.8.18) 
i=l 
Then 
XW = pym (18.8.19) 
COx(N) 


will be approximately Gaussian. There are additional requirements needed to make it work, most 
commonly E[X*] < oo, but the philosophy is clear. 


The central limit theorem is the reason that the Gaussian is fundamental to probability, statis- 
tics, and machine learning. Whenever we can say that something we measured is a sum of many 
small independent contributions, we can assume that the thing being measured will be close to 
Gaussian. 


There are many more fascinating properties of Gaussians, and we would like to discuss one more 
here. The Gaussian is what is known as a maximum entropy distribution. We will get into entropy 
more deeply in Section 18.11, however all we need to know at this point is that it is a measure of 
randomness. In a rigorous mathematical sense, we can think of the Gaussian as the most ran- 
dom choice of random variable with fixed mean and variance. Thus, if we know that our random 
variable has some mean and variance, the Gaussian is in a sense the most conservative choice of 
distribution we can make. 


To close the section, Let us recall that if X ~ N (1,07), then: 


* xX =H, 
a 2 A 
Ox =O. 


We can sample from the Gaussian (or standard normal) distribution as shown below. 


np. random.normal(mu, sigma, size=(10, 10)) 


array([[-0.99370064, -1.14167819, -0.85378714, 0.08260954, -1.80513541, 
0.74352804, 0.46967471, -0.73595535, -1.44984534, -0.24282168], 
[-0.08920022, 0.69443069, -1.36936071, 0.93891018, -1.21716262, 
-0.51426347, -2.25255033, -0.83285848, 0.21587898, -1.75980648], 
[-1.20717774, 0.78939294, 0.65862222, -0.28435577, 0.15900332, 
-0.9270579 , -0.43756911, -0.68422118, 1.74454088, 0.52224481], 
[ 2.20922863, -1.32050501, 0.93244402, 1.41943799, 0.19160707, 
-1.87137331, -0.16778267, -1.06562849, 0.607715 , 0.44044445], 
[-1.93255429, 0.8503131 , -0.75933828, 1.3127858 , 0.18607227, 
0.61615403, -0.44100139, 0.41650152, -1.35798346, -0.25412984], 

[ 0.88823065, -0.77266299, 0.67654889, 0.15340936, -0.07506811, 
1.54103543, 0.52341302, 1.94431951, 1.40757345, 0.80341086], 
[-1.16062622, -3.03280828, -1.46631312, -2.02831708, 0.77889955, 





(continues on next page) 





18.8. Distributions 


905 


(continued from previous page) 


1.06202559, -1.59634417, -1.27254011, -0.86063411, 0.845948371, 
[ 0.07980832, -1.73573892, -0.75334411, 1.10355608, 0.36619175, 

0.9394234 , -0.56450753, -0.41730859, 1.53313022, 0.033053731, 
[ 0.90403421, 0.8908935 , -0.15442275, 1.17499628, 1.1279271 , 
-1.26149299, @.91933322, -0.92666666, -0.29127451, 0.13530692], 
[ @.36000797, -1.30541204, -1.0091219 , -1.24899032, 0.77075423, 
-1.42665032, 1.50204235, 0.38065647, 0.14614544, -0.26688275]1) 


18.8.7 Exponential Family 


One shared property for all the distributions listed above is that they all belong to which is known 
as the exponential family. The exponential family is a set of distributions whose density can be 
expressed in the following form: 


p(x|n) = h(x) -exp(n' - T(x) — A(n)) (18.8.20) 


As this definition can be a little subtle, let us examine it closely. 


First, h(x) is known as the underlying measure or the base measure. This can be viewed as an original 
choice of measure we are modifying with our exponential weight. 


Second, we have the vector n = (m, 72,..-,m) € R! called the natural parameters or canonical pa- 
rameters. These define how the base measure will be modified. The natural parameters enter 
into the new measure by taking the dot product of these parameters against some function T(-) 
of x = (11,12,..., Tn) € R” and exponentiated. T(x) = (7, (x), To(x),..., Tı(x)) is called the suffi- 
cient statistics for y. This name is used since the information represented by T(x) is sufficient to 
calculate the probability density and no other information from the sample x’s are required. 


Third, we have A(7), which is referred to as the cumulant function, which ensures that the above 
distribution (18.8.20) integrates to one, i.e., 


A(n) = log / h(x) -exp(n! - T(x))ae| . (18.8.21) 


To be concrete, let us consider the Gaussian. Assuming that x is an univariate variable, we saw 
that it had a density of 





1 maL 


plx|u, o) = exp E 
vV2r 02 { 20 (18.8.22) 





1 y 1 1 
IEA NE w” (a log(o)) }. 


This matches the definition of the exponential family with: 


* underlying measure: h(x) = = 


ala 
1 j 
2 202 


* sufficient statistics: T(x) = E and 


3 


3 


* natural parameters: y = | 


_ nmol 


~~ Ano 2 


e cumulant function: A(n) = 2? + log(c) log(272). 
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It is worth noting that the exact choice of each of above terms is somewhat arbitrary. Indeed, the 
important feature is that the distribution can be expressed in this form, not the exact form itself. 


As we allude to in Section 3.4.6, a widely used technique is to assume that the final output y follows 
an exponential family distribution. The exponential family is a common and powerful family of 
distributions encountered frequently in machine learning. 


Summary 


Bernoulli random variables can be used to model events with a yes/no outcome. 


Discrete uniform distributions model selects from a finite set of possibilities. 


Continuous uniform distributions select from an interval. 


Binomial distributions model a series of Bernoulli random variables, and count the number 
of successes. 


Poisson random variables model the arrival of rare events. 


Gaussian random variables model the result of adding a large number of independent ran- 
dom variables together. 


All the above distributions belong to exponential family. 


Exercises 


1. What is the standard deviation of a random variable that is the difference X — Y of two 
independent binomial random variables X, Y ~ Binomial(16, 1/2). 


2. If we take a Poisson random variable X ~ Poisson(A) and consider (X — A)/VAas A > œ, 
we can show that this becomes approximately Gaussian. Why does this make sense? 


3. What is the probability mass function for a sum of two discrete uniform random variables 
on n elements? 


Discussions?* 


18.9 Naive Bayes 


Throughout the previous sections, we learned about the theory of probability and random vari- 
ables. To put this theory to work, let us introduce the naive Bayes classifier. This uses nothing but 
probabilistic fundamentals to allow us to perform classification of digits. 


Learning is all about making assumptions. If we want to classify a new data example that we have 
never seen before we have to make some assumptions about which data examples are similar to 
each other. The naive Bayes classifier, a popular and remarkably clear algorithm, assumes all 
features are independent from each other to simplify the computation. In this section, we will 
apply this model to recognize characters in images. 





2 https://discuss.d21.ai/t/417 
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%matplotlib inline 

from d21 import mxnet as d21 
import math 

from mxnet import gluon, np, npx 
npx.set_np() 
d21.use_svg_display() 


18.9.1 Optical Character Recognition 


MNIST (LeCun et al., 1998) is one of widely used datasets. It contains 60,000 images for training 
and 10,000 images for validation. Each image contains a handwritten digit from 0 to 9. The task is 
classifying each image into the corresponding digit. 


Gluon provides a MNIST class in the data. vision module to automatically retrieve the dataset from 
the Internet. Subsequently, Gluon will use the already-downloaded local copy. We specify whether 
we are requesting the training set or the test set by setting the value of the parameter train to 
True or False, respectively. Each image is a grayscale image with both width and height of 28 with 
shape (28,28,1). We use a customized transformation to remove the last channel dimension. In 
addition, the dataset represents each pixel by an unsigned 8-bit integer. We quantize them into 
binary features to simplify the problem. 


def transform(data, label): 
return np.floor(data.astype('float32') / 128).squeeze(axis=-1), label 


mnist_train = gluon.data.vision.MNIST(train=True, transform=transform) 
mnist_test = gluon.data.vision.MNIST(train=False, transform=transform) 


We can access a particular example, which contains the image and the corresponding label. 


image, label = mnist_train[2] 
image.shape, label 


((28, 28), array(4, dtype=int32)) 


Our example, stored here in the variable image, corresponds to an image with a height and width 
of 28 pixels. 


image.shape, image.dtype 

((28, 28), dtype('float32')) 

Our code stores the label of each image as a scalar. Its type is a 32-bit integer. 
label, type(label), label.dtype 

(array(4, dtype=int32), mxnet.numpy.ndarray, dtype('int32')) 


We can also access multiple examples at the same time. 
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images, labels = mnist_train[10:38] 
images.shape, labels.shape 


((28, 28, 28), (28,)) 


Let us visualize these examples. 


d21.show_images(images, 2, 9); 


18.9.2 The Probabilistic Model for Classification 






In a classification task, we map an example into a category. Here an example is a grayscale 28 x 28 
image, and a category is a digit. (Refer to Section 3.4 for a more detailed explanation.) One natural 
way to express the classification task is via the probabilistic question: what is the most likely label 
given the features (i.e., image pixels)? Denote by x € R’ the features of the example and y € R the 
label. Here features are image pixels, where we can reshape a 2-dimensional image to a vector so 
that d = 28? = 784, and labels are digits. The probability of the label given the features is p(y | x). 
If we are able to compute these probabilities, which are p(y | x) for y = 0,...,9 in our example, 
then the classifier will output the prediction y given by the expression: 


Y = argmax p(y | x). (18.9.1) 


Unfortunately, this requires that we estimate p(y | x) for every value of x = 21,..., £q. Imagine 
that each feature could take one of 2 values. For example, the feature xı = 1 might signify that 
the word apple appears in a given document and x, = 0 would signify that it does not. If we had 
30 such binary features, that would mean that we need to be prepared to classify any of 2% (over 
1 billion!) possible values of the input vector x. 


Moreover, where is the learning? If we need to see every single possible example in order to predict 
the corresponding label then we are not really learning a pattern but just memorizing the dataset. 


18.9.3 The Naive Bayes Classifier 


Fortunately, by making some assumptions about conditional independence, we can introduce 
some inductive bias and build a model capable of generalizing from a comparatively modest se- 
lection of training examples. To begin, let us use Bayes theorem, to express the classifier as 


P(X | ypy) 


o (18.9.2) 


g = argmax, p(y | x) = argmax, 
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Note that the denominator is the normalizing term p(x) which does not depend on the value of 
the label y. As a result, we only need to worry about comparing the numerator across different 
values of y. Even if calculating the denominator turned out to be intractable, we could get away 
with ignoring it, so long as we could evaluate the numerator. Fortunately, even if we wanted to 
recover the normalizing constant, we could. We can always recover the normalization term since 


ey Ply |x) =1. 


Now, let us focus on p(x | y). Using the chain rule of probability, we can express the term p(x | y) 
as 


per | y): p(z2 | Hig) plas | Estay. (18.9.3) 


By itself, this expression does not get us any further. We still must estimate roughly 27 parameters. 
However, if we assume that the features are conditionally independent of each other, given the label, 
then suddenly we are in much better shape, as this term simplifies to [ [, p(x; | y), giving us the 
predictor 


d 


y =argmax, | | p(=; | y)p(y). (18.9.4) 
i=1 


If we can estimate [[, p(x; = 1 | y) for every i and y, and save its value in Pryļi, y], here Pry is a 
d x n matrix with n being the number of classes and y € [1,...,n). In addition, we estimate p(y) 
for every y and save it in P, [y], with P} a n-length vector. Then for any new example x, we could 
compute 


d 


y = argmax Peyl2i, Y] Pyly), (18.9.5) 
y 
i=1 


for any y. So our assumption of conditional independence has taken the complexity of our model 
from an exponential dependence on the number of features O(2¢n) to a linear dependence, which 
is O(dn). 


18.9.4 Training 


The problem now is that we do not know Pry and Pj. So we need to estimate their values given 
some training data first. This is training the model. Estimating P, is not too hard. Since we are 
only dealing with 10 classes, we may count the number of occurrences n, for each of the digits 
and divide it by the total amount of data n. For instance, if digit 8 occurs ng = 5, 800 times and we 
have a total of n = 60,000 images, the probability estimate is p(y = 8) = 0.0967. 


X, Y = mist_train[:] + All training examples 


n_y = np.zeros((10)) 
for y in range(10): 
n-y[y] = (Y == y).sum() 


P_y = n_y / n_y.sum() 
Py 


array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667, 
0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 7) 
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Now on to slightly more difficult things Psy. Since we picked black and white images, p(x; | y) 
denotes the probability that pixel i is switched on for class y. Just like before we can go and count 
the number of times n;, such that an event occurs and divide it by the total number of occurrences 
of y, i.e., ny. But there is something slightly troubling: certain pixels may never be black (e.g., for 
well cropped images the corner pixels might always be white). A convenient way for statisticians 
to deal with this problem is to add pseudo counts to all occurrences. Hence, rather than niy we 
use niy + 1 and instead of ny we use ny + 1. This is also called Laplace Smoothing. It may seem 
ad-hoc, however it may be well motivated from a Bayesian point-of-view. 
n_x = np.zeros((10, 28, 28)) 
for y in range(10): 

n_xLy] = np.array(X.asnumpy()LY.asnumpy() == y].sum(axis=0)) 
P_xy = (n_x + 1) / (n_y + 1).reshape(10, 1, 1) 


d21.show_images(P_xy, 2, 5); 


By visualizing these 10 x 28 x 28 probabilities (for each pixel for each class) we could get some 
mean looking digits. 






Now we can use (18.9.5) to predict a new image. Given x, the following functions computes p(x | 
y)p(y) for every y. 


def bayes_pred(x): 
x = np.expand_dims(x, axis=0) + (28, 28) -> (1, 28, 28) 
p_xy = P_xy * x + (1 - P_xy)4(1 - x) 
p_xy = p_xy.reshape(10, -1).prod(axis=1) + p(xly) 
return np.array(p_xy) * P_y 


image, label = mnist_test[0] 
bayes_pred(image) 


array([0., @., 0.,0.,0.,0.,0., @., 0., 0.7) 


This went horribly wrong! To find out why, let us look at the per pixel probabilities. They are 
typically numbers between 0.001 and 1. We are multiplying 784 of them. At this point it is worth 
mentioning that we are calculating these numbers on a computer, hence with a fixed range for the 
exponent. What happens is that we experience numerical underflow, i.e., multiplying all the small 
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numbers leads to something even smaller until it is rounded down to zero. We discussed this as a 
theoretical issue in Section 18.7, but we see the phenomena clearly here in practice. 


As discussed in that section, we fix this by use the fact that log ab = log a + log b, i.e., we switch to 
summing logarithms. Even if both a and b are small numbers, the logarithm values should be in 
a proper range. 


a=0.1 
print('underflow:', axx*784) 
print('logarithm is normal:’, 784xmath.log(a)) 


underflow: 0.0 
logarithm is normal: -1805.2267129073316 


Since the logarithm is an increasing function, we can rewrite (18.9.5) as 


d 
g = argmax, y log P.y[x;, y] + log P, ly). (18.9.6) 


i=1 


We can implement the following stable version: 


log_P_xy = np.log(P_xy) 
log_P_xy_neg = np.log(1 - P_xy) 
log_P_y = np.log(P_y) 


def bayes_pred_stable(x): 
x = np.expand_dims(x, axis=0) + (28, 28) -> (1, 28, 28) 
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x) 
p_xy = p_xy.reshape(10, -1).sum(axis=1) + p(x]y) 
return p_xy + log_P_y 


py = bayes_pred_stable(image) 
py 


array([-269.00424, -301.73447, -245.21458, -218.8941 , -193.46907, 
-206.10315, -292.54315, -114.62834, -220.35619, -163.18881]) 


We may now check if the prediction is correct. 
# Convert label which is a scalar tensor of int32 dtype 


# to a Python scalar integer for comparison 
py.argmax(axis=0) == int(label) 


array(True) 


If we now predict a few validation examples, we can see the Bayes classifier works pretty well. 


def predict(X): 
return [bayes_pred_stable(x).argmax(axis=0).astype(np.int32) for x in X] 


X, y = mnist_test[:18] 
preds = predict(X) 
d21.show_images(X, 2, 9, titles=[str(d) for d in preds]); 
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7 2 1 0 4 1 4 9 4 
9 0 6 9 0 1 3 9 7 


Finally, let us compute the overall accuracy of the classifier. 






X, y = mnist_test[:] 
preds = np.array(predict(X), dtype=np.int32) 
float((preds == y).sum()) / len(y) # Validation accuracy 


0.8426 


Modern deep networks achieve error rates of less than 0.01. The relatively poor performance is 
due to the incorrect statistical assumptions that we made in our model: we assumed that each 
and every pixel are independently generated, depending only on the label. This is clearly not how 
humans write digits, and this wrong assumption led to the downfall of our overly naive (Bayes) 
classifier. 


Summary 


e Using Bayes’ rule, a classifier can be made by assuming all observed features are indepen- 
dent. 


* This classifier can be trained on a dataset by counting the number of occurrences of combi- 
nations of labels and pixel values. 


e This classifier was the gold standard for decades for tasks such as spam detection. 


Exercises 


1. Consider the dataset [[0, 0], [0, 1], [1, 0], [1, 1]] with labels given by the XOR of the two elements 
[0, 1, 1,0]. What are the probabilities for a Naive Bayes classifier built on this dataset. Does 
it successfully classify our points? If not, what assumptions are violated? 


2. Suppose that we did not use Laplace smoothing when estimating probabilities and a data 
example arrived at testing time which contained a value never observed in training. What 
would the model output? 


3. The naive Bayes classifier is a specific example of a Bayesian network, where the dependence 
of random variables are encoded with a graph structure. While the full theory is beyond the 
scope of this section (see (Koller & Friedman, 2009) for full details), explain why allowing ex- 
plicit dependence between the two input variables in the XOR model allows for the creation 
of a successful classifier. 
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Discussions?” 


18.10 Statistics 


Undoubtedly, to be a top deep learning practitioner, the ability to train the state-of-the-art and 
high accurate models is crucial. However, it is often unclear when improvements are significant, 
or only the result of random fluctuations in the training process. To be able to discuss uncertainty 
in estimated values, we must learn some statistics. 


The earliest reference of statistics can be traced back to an Arab scholar Al-Kindi in the 9% -century, 


who gave a detailed description of how to use statistics and frequency analysis to decipher en- 
crypted messages. After 800 years, the modern statistics arose from Germany in 1700s, when 
the researchers focused on the demographic and economic data collection and analysis. Today, 
statistics is the science subject that concerns the collection, processing, analysis, interpretation 
and visualization of data. What is more, the core theory of statistics has been widely used in the 
research within academia, industry, and government. 


More specifically, statistics can be divided to descriptive statistics and statistical inference. The for- 
mer focus on summarizing and illustrating the features of a collection of observed data, which is 
referred to as a sample. The sample is drawn from a population, denotes the total set of similar 
individuals, items, or events of our experiment interests. Contrary to descriptive statistics, sta- 
tistical inference further deduces the characteristics of a population from the given samples, based 
on the assumptions that the sample distribution can replicate the population distribution at some 
degree. 


You may wonder: “What is the essential difference between machine learning and statistics?” 
Fundamentally speaking, statistics focuses on the inference problem. This type of problems in- 
cludes modeling the relationship between the variables, such as causal inference, and testing the 
statistically significance of model parameters, such as A/B testing. In contrast, machine learning 
emphasizes on making accurate predictions, without explicitly programming and understanding 
each parameter's functionality. 


In this section, we will introduce three types of statistics inference methods: evaluating and com- 
paring estimators, conducting hypothesis tests, and constructing confidence intervals. These 
methods can help us infer the characteristics of a given population, i.e., the true parameter 0. For 
brevity, we assume that the true parameter 0 of a given population is a scalar value. Itis straight- 
forward to extend to the case where @ is a vector or a tensor, thus we omit it in our discussion. 


18.10.1 Evaluating and Comparing Estimators 


In statistics, an estimator is a function of given samples used to estimate the true parameter 0. We 
will write 0,, = f(x1,...,1,) for the estimate of 0 after observing the samples {x , 12,..., En}. 


We have seen simple examples of estimators before in section Section 18.7. If you have a num- 
ber of samples from a Bernoulli random variable, then the maximum likelihood estimate for the 
probability the random variable is one can be obtained by counting the number of ones observed 
and dividing by the total number of samples. Similarly, an exercise asked you to show that the 
maximum likelihood estimate of the mean of a Gaussian given a number of samples is given by 
the average value of all the samples. These estimators will almost never give the true value of the 
parameter, but ideally for a large number of samples the estimate will be close. 





22 https://discuss.d21.ai/t/418 
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As an example, we show below the true density of a Gaussian random variable with mean zero and 
variance one, along with a collection samples from that Gaussian. We constructed the y coordinate 


so every point is visible and the relationship to the original density is clearer. 


from d21 import mxnet as d21 
from mxnet import np, npx 
import random 

npx.set_np() 


# Sample datapoints and create y coordinate 
epsilon = 0.1 

random. seed(8675309) 

xs = np.random.normal(loc=0, scale=1, size=(300,)) 


ys = Enp.sum(np.exp(-(xs[:i1] - xsli])**2 / (2 * epsilon**2)) 
/ np.sqrt(2*np.pixepsilon*x*2)) / len(xs) for i in range(len(xs))] 


# Compute true density 
xd = np.arange(np.min(xs), np.max(xs), 0.01) 
yd = np.exp(-xd**2/2) / np.sqrt(2 * np.pi) 


# Plot the results 

d21.plot(xd, yd, ‘x’, 'density') 

d21.plt.scatter(xs, ys) 

d21.p1t.axvline(x=0) 

d21.plt.axvline(x=np.mean(xs), linestyle='--', color='purple’) 
d21.p1t.title(f'sample mean: {float(np.mean(xs)): .2f}’) 
d21.p1t.show() 


sample mean: 0.05 


density 





There can be many ways to compute an estimator of a parameter 6,,. In this section, we intro- 
duce three common methods to evaluate and compare estimators: the mean squared error, the 


standard deviation, and statistical bias. 
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Mean Squared Error 


Perhaps the simplest metric used to evaluate estimators is the mean squared error (MSE) (or la loss) 
of an estimator can be defined as 


MSE(6n,9) = E[(6n — 0)?]. (18.10.1) 


This allows us to quantify the average squared deviation from the true value. MSE is always non- 
negative. If you have read Section 3.1, you will recognize it as the most commonly used regression 
loss function. As a measure to evaluate an estimator, the closer its value to zero, the closer the 
estimator is close to the true parameter 0. 


Statistical Bias 


The MSE provides a natural metric, but we can easily imagine multiple different phenomena that 
might make it large. Two fundamentally important are fluctuation in the estimator due to ran- 
domness in the dataset, and systematic error in the estimator due to the estimation procedure. 


First, let us measure the systematic error. For an estimator Ons the mathematical illustration of 
statistical bias can be defined as 


bias(6,) = E(6n — 0) = E(On) — 0. (18.10.2) 


Note that when bias(Ó,) = 0, the expectation of the estimator Ô, is equal to the true value of 
parameter. In this case, we say 0,, is an unbiased estimator. In general, an unbiased estimator is 
better than a biased estimator since its expectation is the same as the true parameter. 


It is worth being aware, however, that biased estimators are frequently used in practice. There are 
cases where unbiased estimators do not exist without further assumptions, or are intractable to 
compute. This may seem like a significant flaw in an estimator, however the majority of estimators 
encountered in practice are at least asymptotically unbiased in the sense that the bias tends to zero 


as the number of available samples tends to infinity: lim,_,,. bias(0,,) = 0. 


Variance and Standard Deviation 


Second, let us measure the randomness in the estimator. Recall from Section 18.6, the standard 
deviation (or standard error) is defined as the squared root of the variance. We may measure the 
degree of fluctuation of an estimator by measuring the standard deviation or variance of that es- 
timator. 








0%, = y Var(bn) = y El(Ó, — E(0,))2. (18.10.3) 


It is important to compare (18.10.3) to (18.10.1). In this equation we do not compare to the true 
population value 0, but instead to E(6,,), the expected sample mean. Thus we are not measuring 
how far the estimator tends to be from the true value, but instead we measuring the fluctuation of 
the estimator itself. 
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The Bias-Variance Trade-off 


It is intuitively clear that these two main components contribute to the mean squared error. What 
is somewhat shocking is that we can show that this is actually a decomposition of the mean squared 
error into these two contributions plus a third one. That is to say that we can write the mean 
squared error as the sum of the square of the bias, the variance and the irreducible error. 


MSE(Ó,,, 0) = E[(0, — 0)? 
= E[(6n)?] + E[6?] — 2E[06,0] 
Var 


= Var|6n] + E[n]? ap ] + El6]? — 2E[0,,]E[0] (18.10.4) 
= (E[6,] — E[6])? + Var[6,] + Var[9] 
= (E[6, — 0)? + Var[6n] + Var[6] 
= (bias[6,,])? + Var(6,) + Var[0]. 


We refer the above formula as bias-variance trade-off. The mean squared error can be divided into 
three sources of error: the error from high bias, the error from high variance and the irreducible 
error. The bias error is commonly seen in a simple model (such as a linear regression model), 
which cannot extract high dimensional relations between the features and the outputs. If a model 
suffers from high bias error, we often say it is underfitting or lack of flexibilty as introduced in (Sec- 
tion 4.4). The high variance usually results from a too complex model, which overfits the training 
data. As a result, an overfitting model is sensitive to small fluctuations in the data. If a model suf- 
fers from high variance, we often say it is overfitting and lack of generalization as introduced in 
(Section 4.4). The irreducible error is the result from noise in the 6 itself. 


Evaluating Estimators in Code 


Since the standard deviation of an estimator has been implementing by simply calling a. std() for 
a tensor a, we will skip it but implement the statistical bias and the mean squared error. 


# Statistical bias 
def stat_bias(true_theta, est_theta): 
return(np.mean(est_theta) - true_theta) 


# Mean squared error 
def mse(data, true_theta): 
return(np.mean(np.square(data - true_theta))) 


To illustrate the equation of the bias-variance trade-off, let us simulate of normal distribution 
N (0,0?) with 10,000 samples. Here, we use a 9 = 1 and o = 4. As the estimator is a function 
of the given samples, here we use the mean of the samples as an estimator for true 0 in this nor- 
mal distribution N (0,07) . 


theta_true = 1 

sigma = 4 

sample_len = 10000 

samples = np.random. normal (theta_true, sigma, sample_len) 
theta_est = np.mean(samples) 

theta_est 
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array(0.9503336) 


Let us validate the trade-off equation by calculating the summation of the squared bias and the 
variance of our estimator. First, calculate the MSE of our estimator. 


mse(samples, theta_true) 


array(15.781996) 


Next, we calculate Var(6,,) +[bias(0,,)]? as below. As you can see, the two values agree to numerical 
precision. 


bias = stat_bias(theta_true, theta_est) 
np.square(samples.std()) + np.square(bias) 


array(15.781995) 


18.10.2 Conducting Hypothesis Tests 


The most commonly encountered topic in statistical inference is hypothesis testing. While hy- 
pothesis testing was popularized in the early 20th century, the first use can be traced back to John 
Arbuthnot in the 1700s. John tracked 80-year birth records in London and concluded that more 
men were born than women each year. Following that, the modern significance testing is the in- 
telligence heritage by Karl Pearson who invented p-value and Pearson's chi-squared test, William 
Gosset who is the father of Student's t-distribution, and Ronald Fisher who initialed the null hy- 
pothesis and the significance test. 


A hypothesis test is a way of evaluating some evidence against the default statement about a pop- 
ulation. We refer the default statement as the null hypothesis Ho, which we try to reject using the 
observed data. Here, we use Ho as a starting point for the statistical significance testing. The alter- 
native hypothesis H 4 (or Hı) is a statementthatis contrary to the null hypothesis. A null hypothesis 
is often stated in a declarative form which posits a relationship between variables. It should reflect 
the brief as explicit as possible, and be testable by statistics theory. 


Imagine you are a chemist. After spending thousands of hours in the lab, you develop a new 
medicine which can dramatically improve one's ability to understand math. To show its magic 
power, you need to test it. Naturally, you may need some volunteers to take the medicine and see 
whether it can help them learn math better. How do you get started? 


First, you will need carefully random selected two groups of volunteers, so that there is no differ- 
ence between their math understanding ability measured by some metrics. The two groups are 
commonly referred to as the test group and the control group. The test group (or treatment group) 
is a group of individuals who will experience the medicine, while the control group represents the 
group of users who are set aside as a benchmark, i.e., identical environment setups except taking 
this medicine. In this way, the influence of all the variables are minimized, except the impact of 
the independent variable in the treatment. 


Second, after a period of taking the medicine, you will need to measure the two groups’ math 
understanding by the same metrics, such as letting the volunteers do the same tests after learninga 
new math formula. Then, you can collect their performance and compare the results. In this case, 
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our null hypothesis will be that there is no difference between the two groups, and our alternate 
will be that there is. 


This is still not fully formal. There are many details you have to think of carefully. For example, 
what is the suitable metrics to test their math understanding ability? How many volunteers for 
your test so you can be confident to claim the effectiveness of your medicine? How long should 
you run the test? How do you decide if there is a difference between the two groups? Do you care 
about the average performance only, or also the range of variation of the scores? And so on. 


In this way, hypothesis testing provides a framework for experimental design and reasoning about 
certainty in observed results. If we can now show that the null hypothesis is very unlikely to be 
true, we may reject it with confidence. 


To complete the story of how to work with hypothesis testing, we need to now introduce some 
additional terminology and make some of our concepts above formal. 


Statistical Significance 


The statistical significance measures the probability of erroneously rejecting the null hypothesis, 
Ho, when it should not be rejected, i.e., 


statistical significance = 1 — a = 1 — P(reject Ho | Ho is true). (18.10.5) 


It is also referred to as the type I error or false positive. The a, is called as the significance level and 
its commonly used value is 5%, i.e., 1 — a = 95%. The significance level can be explained as the 
level of risk that we are willing to take, when we reject a true null hypothesis. 


Fig. 18.10.1 shows the observations’ values and probability of a given normal distribution in a two- 
sample hypothesis test. If the observation data example is located outsides the 95% threshold, it 
will be a very unlikely observation under the null hypothesis assumption. Hence, there might be 
something wrong with the null hypothesis and we will reject it. 


97.5% significance threshold 
(1 - 0/2) 






Very unusual : ' Very unusual 


observations ' , observations 


<< 
' 


Probability of observations 


' 
—— 
' 







95% Confidence interval 


Value of observations 
An outlier (p-value < 5%) 


Fig. 18.10.1: Statistical significance. 
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Statistical Power 


The statistical power (or sensitivity) measures the probability of reject the null hypothesis, Ho, when 
it should be rejected, i.e., 


statistical power = 1 — 8 = 1 — P( fail to reject Ho | Ho is false). (18.10.6) 


Recall that a type I error is error caused by rejecting the null hypothesis when it is true, whereas a 
type II error is resulted from failing to reject the null hypothesis when it is false. A type II error is 
usually denoted as P, and hence the corresponding statistical power is 1 — 6. 


Intuitively, statistical power can be interpreted as how likely our test will detect a real discrepancy 
of some minimum magnitude at a desired statistical significance level. 80% is a commonly used 
statistical power threshold. The higher the statistical power, the more likely we are to detect true 
differences. 


One of the most common uses of statistical power is in determining the number of samples 
needed. The probability you reject the null hypothesis when it is false depends on the degree 
to which it is false (known as the effect size) and the number of samples you have. As you might 
expect, small effect sizes will require a very large number of samples to be detectable with high 
probability. While beyond the scope of this brief appendix to derive in detail, as an example, want 
to be able to reject a null hypothesis that our sample came from a mean zero variance one Gaus- 
sian, and we believe that our sample’s mean is actually close to one, we can do so with acceptable 
error rates with a sample size of only 8. However, if we think our sample population true mean is 
close to 0.01, then we'd need a sample size of nearly 80000 to detect the difference. 


We can imagine the power as a water filter. In this analogy, a high power hypothesis test is like a 
high quality water filtration system that will reduce harmful substances in the water as much as 
possible. On the other hand, a smaller discrepancy is like a low quality water filter, where some 
relative small substances may easily escape from the gaps. Similarly, if the statistical power is not 
of enough high power, then the test may not catch the smaller discrepancy. 


Test Statistic 


A test statistic T(x) is a scalar which summarizes some characteristic of the sample data. The goal 
of defining such a statistic is that it should allow us to distinguish between different distributions 
and conduct our hypothesis test. Thinking back to our chemist example, if we wish to show that 
one population performs better than the other, it could be reasonable to take the mean as the 
test statistic. Different choices of test statistic can lead to statistical test with drastically different 
statistical power. 


Often, T(X) (the distribution of the test statistic under our null hypothesis) will follow, at least 
approximately, acommon probability distribution such as a normal distribution when considered 
under the null hypothesis. If we can derive explicitly such a distribution, and then measure our 
test statistic on our dataset, we can safely reject the null hypothesis if our statistic is far outside 
the range that we would expect. Making this quantitative leads us to the notion of p-values. 
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p-value 


The p-value (or the probability value) is the probability that T(X) is at least as extreme as the ob- 
served test statistic T(x) assuming that the null hypothesis is true, i.e., 


p-value = Pg, (T(X) > T(x)). (18.10.7) 


If the p-value is smaller than or equal to a predefined and fixed statistical significance level a, we 
may reject the null hypothesis. Otherwise, we will conclude that we are lack of evidence to reject 
the null hypothesis. For a given population distribution, the region of rejection will be the interval 
contained of all the points which has a p-value smaller than the statistical significance level a. 


One-side Test and Two-sided Test 


Normally there are two kinds of significance test: the one-sided test and the two-sided test. The 
one-sided test (or one-tailed test) is applicable when the null hypothesis and the alternative hypoth- 
esis only have one direction. For example, the null hypothesis may state that the true parameter 
0 is less than or equal to a value c. The alternative hypothesis would be that 0 is greater than c. 
That is, the region of rejection is on only one side of the sampling distribution. Contrary to the 
one-sided test, the two-sided test (or two-tailed test) is applicable when the region of rejection is on 
both sides of the sampling distribution. An example in this case may have a null hypothesis state 
that the true parameter 0 is equal to a value c. The alternative hypothesis would be that 0 is not 
equal to c. 


General Steps of Hypothesis Testing 
After getting familiar with the above concepts, let us go through the general steps of hypothesis 
testing. 

1. State the question and establish a null hypotheses Hp. 

2. Set the statistical significance level a and a statistical power (1 — 8). 


3. Obtain samples through experiments. The number of samples needed will depend on the 
statistical power, and the expected effect size. 


4. Calculate the test statistic and the p-value. 


5. Make the decision to keep or reject the null hypothesis based on the p-value and the statistical 
significance level a. 


To conduct a hypothesis test, we start by defining a null hypothesis and a level of risk that we 
are willing to take. Then we calculate the test statistic of the sample, taking an extreme value of 
the test statistic as evidence against the null hypothesis. If the test statistic falls within the reject 
region, we may reject the null hypothesis in favor of the alternative. 


Hypothesis testing is applicable in a variety of scenarios such as the clinical trails and A/B testing. 
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18.10.3 Constructing Confidence Intervals 


When estimating the value of a parameter 0, point estimators like Ó are of limited utility since they 
contain no notion of uncertainty. Rather, it would be far better if we could produce an interval 
that would contain the true parameter 0 with high probability. If you were interested in such 
ideas a century ago, then you would have been excited to read “Outline of a Theory of Statistical 
Estimation Based on the Classical Theory of Probability” by Jerzy Neyman (Neyman, 1937), who 
first introduced the concept of confidence interval in 1937. 


To be useful, a confidence interval should be as small as possible for a given degree of certainty. 
Let us see how to derive it. 


Definition 


Mathematically, a confidence interval for the true parameter 0 is an interval C, that computed from 
the sample data such that 


Po(Cy 2 0) > 1 — a, Y0. (18.10.8) 


Here a € (0,1), and 1 — ais called the confidence level or coverage of the interval. This is the same 
a as the significance level as we discussed about above. 


Note that (18.10.8) is about variable Cn, not about the fixed 0. To emphasize this, we write Pp(C,, > 
9) rather than Py(0 € Chn). 


Interpretation 


Itis very tempting to interpret a 95% confidence interval as an interval where you can be 95% sure 
the true parameter lies, however this is sadly not true. The true parameter is fixed, and it is the 
interval that is random. Thus a better interpretation would be to say that if you generated a large 
number of confidence intervals by this procedure, 95% of the generated intervals would contain 
the true parameter. 


This may seem pedantic, but it can have real implications for the interpretation of the results. 
In particular, we may satisfy (18.10.8) by constructing intervals that we are almost certain do not 
contain the true value, as long as we only do so rarely enough. We close this section by providing 
three tempting but false statements. An in-depth discussion of these points can be found in (Morey 
et al., 2016). 


e Fallacy 1. Narrow confidence intervals mean we can estimate the parameter precisely. 


+ Fallacy 2. The values inside the confidence interval are more likely to be the true value than 
those outside the interval. 


+ Fallacy 3. The probability that a particular observed 95% confidence interval contains the 
true value is 95%. 


Sufficed to say, confidence intervals are subtle objects. However, if you keep the interpretation 
clear, they can be powerful tools. 
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A Gaussian Example 


Let us discuss the most classical example, the confidence interval for the mean of a Gaussian of 
unknown mean and variance. Suppose we collect n samples {2;}"_, from our Gaussian N (p, 07). 
We can compute estimators for the mean and standard deviation by taking 





, . 1 Ss . 
Hn = yt and ô? = ei (xi = By. (18.10.9) 
i=1 i=l 
If we now consider the random variable 
Lin — Hl 
=e 18.10.10 
AE ( ) 


we obtain a random variable following a well-known distribution called the Student’s t-distribution 
on n — 1 degrees of freedom. 


This distribution is very well studied, and it is known, for instance, that as n > oo, it is approx- 
imately a standard Gaussian, and thus by looking up values of the Gaussian c.d.f. in a table, we 
may conclude that the value of T is in the interval |—1.96, 1.96] at least 95% of the time. For finite 
values of n, the interval needs to be somewhat larger, but are well known and precomputed in 
tables. 


Thus, we may conclude that for large n, 





fin — H 
p(# € [-1.96, 1.96] | > 0.95. (18.10.11) 
Graz | ) 


Rearranging this by multiplying both sides by G,,/,/n and then adding în, we obtain 


On a 
Ja 


Thus we know that we have found our 95% confidence interval: 





P( ue | fm —1.96 + 1.962%] ) > 0.95. (18.10.12) 
yn 


hn 1062 M4 190% (18.10.13) 
n 


vn vn 

It is safe to say that (18.10.13) is one of the most used formula in statistics. Let us close our discus- 
sion of statistics by implementing it. For simplicity, we assume we are in the asymptotic regime. 
Small values of N should include the correct value of t_star obtained either programmatically or 
from a t-table. 


# Number of samples 
N = 1000 


# Sample dataset 
samples = np.random.normal(loc=0, scale=1, size=(N,)) 


# Lookup Students's t-distribution c.d.f. 
t_star = 1.96 


# Construct interval 

mu_hat = np.mean(samples) 

sigma_hat = samples.std(ddof=1) 

(mu_hat - t_star*sigma_hat/np.sqrt(N), mu_hat + t_star*sigma_hat/np.sqrt(N)) 
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(array(-0.07853346), array(0.04412608)) 


Summary 


Statistics focuses on inference problems, whereas deep learning emphasizes on making ac- 
curate predictions without explicitly programming and understanding. 


There are three common statistics inference methods: evaluating and comparing estima- 
tors, conducting hypothesis tests, and constructing confidence intervals. 


There are three most common estimators: statistical bias, standard deviation, and mean 
square error. 


A confidence interval is an estimated range of a true population parameter that we can con- 
struct by given the samples. 


Hypothesis testing is a way of evaluating some evidence against the default statement about 
a population. 


Exercises 


1. Let X1,X3,..., Xn E Unif(0, 0), where “iid” stands for independent and identically distributed. 


Consider the following estimators of 6: 


6 = max{ X1, Xo,..., Xn}; (18.10.14) 
Xi. (18.10.15) 


* Find the statistical bias, standard deviation, and mean square error of ô. 
- Find the statistical bias, standard deviation, and mean square error of 0. 
e Which estimator is better? 


2. For our chemist example in introduction, can you derive the 5 steps to conduct a two-sided 
hypothesis testing? Given the statistical significance level a = 0.05 and the statistical power 
1-6=08. 


3. Run the confidence interval code with N = 2 anda = 0.5 for 100 independently generated 
dataset, and plot the resulting intervals (in this case t_star = 1.0). You will see several very 
short intervals which are very far from containing the true mean 0. Does this contradict the 
interpretation of the confidence interval? Do you feel comfortable using short intervals to 
indicate high precision estimates? 


Discussions? 





2 https://discuss.d21.ai/t/419 
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18.11 Information Theory 


The universe is overflowing with information. Information provides a common language across 
disciplinary rifts: from Shakespeare’s Sonnet to researchers’ paper on Cornell ArXiv, from Van 
Gogh’s printing Starry Night to Beethoven’s music Symphony No. 5, from the first programming 
language Plankalkiil to the state-of-the-art machine learning algorithms. Everything must follow 
the rules of information theory, no matter the format. With information theory, we can measure 
and compare how much information is present in different signals. In this section, we will inves- 
tigate the fundamental concepts of information theory and applications of information theory in 
machine learning. 


Before we get started, let us outline the relationship between machine learning and information 
theory. Machine learning aims to extract interesting signals from data and make critical pre- 
dictions. On the other hand, information theory studies encoding, decoding, transmitting, and 
manipulating information. As a result, information theory provides fundamental language for 
discussing the information processing in machine learned systems. For example, many machine 
learning applications use the cross entropy loss as described in Section 3.4. This loss can be di- 
rectly derived from information theoretic considerations. 


18.11.1 Information 


Let us start with the “soul” of information theory: information. Information can be encoded in 
anything with a particular sequence of one or more encoding formats. Suppose that we task our- 
selves with trying to define a notion of information. What could be our starting point? 


Consider the following thought experiment. We have a friend with a deck of cards. They will 
shuffle the deck, flip over some cards, and tell us statements about the cards. We will try to assess 
the information content of each statement. 


First, they flip over a card and tell us, “I see a card.” This provides us with no information at all. 
We were already certain that this was the case so we hope the information should be zero. 


Next, they flip over a card and say, “I see a heart.” This provides us some information, but in reality 
there are only 4 different suits that were possible, each equally likely, so we are not surprised by 
this outcome. We hope that whatever the measure of information, this event should have low 
information content. 


Next, they flip over a card and say, “This is the 3 of spades.” This is more information. Indeed there 
were 52 equally likely possible outcomes, and our friend told us which one it was. This should be 
a medium amount of information. 


Let us take this to the logical extreme. Suppose that finally they flip over every card from the deck 
and read off the entire sequence of the shuffled deck. There are 52! different orders to the deck, 
again all equally likely, so we need a lot of information to know which one it is. 


Any notion of information we develop must conform to this intuition. Indeed, in the next sec- 
tions we will learn how to compute that these events have 0 bits, 2 bits, 5.7 bits, and 225.6 bits of 
information respectively. 


If we read through these thought experiments, we see a natural idea. As a starting point, rather 
than caring about the knowledge, we may build off the idea that information represents the degree 
of surprise or the abstract possibility of the event. For example, if we want to describe an unusual 
event, we need a lot information. For a common event, we may not need much information. 
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In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) 
establishing the theory of information. In his article, Shannon introduced the concept of infor- 
mation entropy for the first time. We will begin our journey here. 


Self-information 


Since information embodies the abstract possibility of an event, how do we map the possibility 
to the number of bits? Shannon introduced the terminology bit as the unit of information, which 
was originally created by John Tukey. So what is a “bit” and why do we use it to measure infor- 
mation? Historically, an antique transmitter can only send or receive two types of code: 0 and 
1. Indeed, binary encoding is still in common use on all modern digital computers. In this way, 
any information is encoded by a series of 0 and 1. And hence, a series of binary digits of length n 
contains n bits of information. 


Now, suppose that for any series of codes, each 0 or 1 occurs with a probability of 5 Hence, an 
event X with a series of codes of length n, occurs with a probability of +. At the same time, as 
we mentioned before, this series contains n bits of information. So, can we generalize to a math 
function which can transfer the probability p to the number of bits? Shannon gave the answer by 
defining self-information 


I(X) = —log,(p), (18.11.1) 


as the bits of information we have received for this event X. Note that we will always use base-2 
logarithms in this section. For the sake of simplicity, the rest of this section will omit the subscript 
2 in the logarithm notation, i.e., log(.) always refers to log,(.). For example, the code “0010” has a 
self-information 


I("0010") = — log(p("0010")) = — log (5) = 4 bits. (18.11.2) 


We can calculate self information as shown below. Before that, let us first import all the necessary 
packages in this section. 


from mxnet import np 
from mxnet.metric import NegativeLogLikelihood 
from mxnet.ndarray import nansum 


import random 


def self_information(p): 
return -np.log2(p) 


self_information(1 / 64) 


6.0 
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18.11.2 Entropy 


As self-information only measures the information of a single discrete event, we need a more 
generalized measure for any random variable of either discrete or continuous distribution. 


Motivating Entropy 


Let us try to get specific about what we want. This will be an informal statement of what are known 
as the axioms of Shannon entropy. It will turn out that the following collection of common-sense 
statements force us to a unique definition of information. A formal version of these axioms, along 
with several others may be found in (Csiszar, 2008). 


1. The information we gain by observing a random variable does not depend on what we call 
the elements, or the presence of additional elements which have probability zero. 


2. The information we gain by observing two random variables is no more than the sum of the 
information we gain by observingthem separately. Ifthey are independent, then itis exactly 
the sum. 


3. The information gained when observing (nearly) certain events is (nearly) zero. 


While proving this fact is beyond the scope of our text, it is important to know that this uniquely 
determines the form that entropy must take. The only ambiguity that these allow is in the choice 
of fundamental units, which is most often normalized by making the choice we saw before that 
the information provided by a single fair coin flip is one bit. 


Definition 


For any random variable X that follows a probability distribution P with a probability density 
function (p.d.f.) or a probability mass function (p.m.f.) p(x), we measure the expected amount of 
information through entropy (or Shannon entropy) 


H(X) =—E,~pllog p(z)]. (18.11.3) 
To be specific, if X is discrete, 
H(X)=- Sp log p;, where p; = P(X;). (18.11.4) 
Otherwise, if X is continuous, we also refer entropy as differential entropy 


H(X)=- fræ log p(x) dz. (18.11.5) 


x 


We can define entropy as below. 


def entropy(p): 
entropy = - p * np.log2(p) 
# Operator nansum will sum up the non-nan number 
out = nansum(entropy.as_nd_ndarray()) 
return out 


entropy(np.array([0.1, 0.5, 0.1, @.3])) 
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[1.6854753] 
<NDArray 1 @cpu(Q)> 


Interpretations 


You may be curious: in the entropy definition (18.11.3), why do we use an expectation of a negative 
logarithm? Here are some intuitions. 


First, why do we use a logarithm function log? Suppose that p(x) = f(x) fo(x)..., f(x), where 
each component function f;(x) is independent from each other. This means that each f;(x) con- 
tributes independently to the total information obtained from p(x). As discussed above, we want 
the entropy formula to be additive over independent random variables. Luckily, log can naturally 
turn a product of probability distributions to a summation of the individual terms. 


Next, why do we use a negative log? Intuitively, more frequent events should contain less infor- 
mation than less common events, since we often gain more information from an unusual case 
than from an ordinary one. However, log is monotonically increasing with the probabilities, and 
indeed negative for all values in [0, 1]. We need to construct a monotonically decreasing relation- 
ship between the probability of events and their entropy, which will ideally be always positive (for 
nothing we observe should force us to forget what we have known). Hence, we add a negative sign 
in front of log function. 


Last, where does the expectation function come from? Consider a random variable X. We can 
interpret the self-information (— log(p)) as the amount of surprise we have at seeing a particular 
outcome. Indeed, as the probability approaches zero, the surprise becomes infinite. Similarly, 
we can interpret the entropy as the average amount of surprise from observing X. For exam- 


ple, imagine that a slot machine system emits statistical independently symbols s;,..., są with 
probabilities p;,..., px respectively. Then the entropy of this system equals to the average self- 
information from observing each output, i.e., 
H(S)= X pi- 1(s;) = -X pi -logp;. (18.11.6) 
i i 


Properties of Entropy 
By the above examples and interpretations, we can derive the following properties of entropy 
(18.11.3). Here, we refer to X as an event and P as the probability distribution of X. 

+ Entropy is non-negative, i.e., H(X) > 0,VX. 


-If X ~ P with a p.d.f. ora p.m.f. p(x), and we try to estimate P by a new probability 
distribution Q with a p.d.f. or a p.m.f. q(x), then 


H(X) = —Esgnpllog p(x)| < —E,~pllog q(x)], with equality if and only if P = Q. (18.11.7) 


Alternatively, H(X) gives a lower bound of the average number of bits needed to encode 
symbols drawn from P. 


* If X ~ P, then x conveys the maximum amount of information if it spreads evenly among 
all possible outcomes. Specifically, if the probability distribution P is discrete with k-class 


LP: Pp), then 


H(X) < log(k), with equality if and only if p; = 7 (18.11.8) 
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If P is a continuous random variable, then the story becomes much more complicated. 
However, if we additionally impose that P is supported on a finite interval (with all values 
between 0 and 1), then P has the highest entropy if it is the uniform distribution on that 
interval. 


18.11.3 Mutual Information 


Previously we defined entropy of a single random variable X, how about the entropy of a pair 
random variables (X, Y)? We can think of these techniques as trying to answer the following type 
of question, “What information is contained in X and Y together compared to each separately? Is 
there redundant information, or is it all unique?” 


For the following discussion, we always use (X, Y ) asa pair of random variables that follows ajoint 
probability distribution P with a p.d.f. or a p.m.f. px y (x,y), while X and Y follow probability 
distribution px (1) and py (y), respectively. 


Joint Entropy 


Similar to entropy of a single random variable (18.11.3), we define the joint entropy H(X, Y) of a 
pair random variables (X,Y) as 


H(X, Y) = ~E(x,y)~pllog px,y (2, y)). (18.11.9) 
Precisely, on the one hand, if (X,Y) is a pair of discrete random variables, then 


H(X, Y) =- Y pxy(a,y) log pxy(2,y). (18.11.10) 
x y 


On the other hand, if (X, Y) is a pair of continuous random variables, then we define the differential 
joint entropy as 


H(X,Y) = -f px,y (x,y) logpx y (x,y) dz dy. (18.11.11) 
ayy 
We can think of (18.11.9) as telling us the total randomness in the pair of random variables. As a 
pair of extremes, if X = Y are two identical random variables, then the information in the pair 
is exactly the information in one and we have H(X,Y) = H(X) = H(Y). On the other extreme, 
if X and Y are independent then H(X,Y) = H(X) + H(Y). Indeed we will always have that 
the information contained in a pair of random variables is no smaller than the entropy of either 
random variable and no more than the sum of both. 


A(X), H(Y) < H(X, Y) < H(X) + H(Y). (18.11.12) 
Let us implement joint entropy from scratch. 


def joint_entropy(p_xy): 
joint_ent = -p_xy * np.log2(p_xy) 
# Operator nansum will sum up the non-nan number 
out = nansum(joint_ent.as_nd_ndarray()) 
return out 


joint_entropy(np.array([[0.1, 0.5], [0.1, 0.31])) 
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[1.6854753] 
<NDArray 1 @cpu(Q)> 


Notice that this is the same code as before, but now we interpret it differently as working on the 
joint distribution of the two random variables. 


Conditional Entropy 


The joint entropy defined above the amount of information contained in a pair of random vari- 
ables. This is useful, but oftentimes it is not what we care about. Consider the setting of machine 
learning. Let us take X to be the random variable (or vector of random variables) that describes 
the pixel values of an image, and Y to be the random variable which is the class label. X should 
contain substantial information—a natural image is a complex thing. However, the information 
contained in Y once the image has been show should be low. Indeed, the image of a digit should 
already contain the information about what digit it is unless the digit is illegible. Thus, to continue 
to extend our vocabulary of information theory, we need to be able to reason about the informa- 
tion content in a random variable conditional on another. 


In the probability theory, we saw the definition of the conditional probability to measure the rela- 
tionship between variables. We now want to analogously define the conditional entropy H(Y | X). 
We can write this as 


H(Y | X) = —E(e,y)~pllog ply | 2)), (18.11.13) 


where p(y | z) = +2 pave) j Y) is the conditional probability. Specifically, if (X, Y) is a pair of discrete 


random variables, then 
H(Y | X) 222 A (x, y) log ply | x). (18.11.14) 


If (X,Y) is a pair of continuous random we then the differential conditional entropy is simi- 
larly defined as 


H(Y | X) =- f fr x,y) log p(y | x) dz dy. (18.11.15) 


It is now natural to ask, how does the conditional entropy H(Y | X) relate to the entropy H(X) and 
the joint entropy H(X, Y)? Using the definitions above, we can express this cleanly: 


H(Y | X) = H(X,Y) — H(X). (18.11.16) 


This has an intuitive interpretation: the information in Y given X (H(Y | X)) is the same as the 
information in both X and Y together (H(X, Y )) minus the information already contained in X. 
This gives us the information in Y which is not also represented in X. 


Now, let us implement conditional entropy (18.11.13) from scratch. 


def conditional_entropy(p_xy, p_x): 
p_y_given_x = p_xy/p_x 
cond_ent = -p_xy * np.log2(p_y_given_x) 
# Operator nansum will sum up the non-nan number 
out = nansum(cond_ent.as_nd_ndarray()) 
return out 





conditional_entropy(np.array([[0.1, 2.5], [0.2, @.3]]), np.array([0.2, 0.8])) 
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[0.8635472] 
<NDArray 1 @cpu(Q)> 


Mutual Information 


Given the previous setting of random variables (X, Y), you may wonder: “Now that we know how 
much information is contained in Y but not in X, can we similarly ask how much information is 
shared between X and Y?” The answer will be the mutual information of (X,Y), which we will 
write as 1(X, Y ). 

Rather than diving straight into the formal definition, let us practice our intuition by first trying 
to derive an expression for the mutual information entirely based on terms we have constructed 
before. We wish to find the information shared between two random variables. One way we could 
try to do this is to start with all the information contained in both X and Y together, and then 
we take off the parts that are not shared. The information contained in both X and Y together is 
written as H(X, Y). We want to subtract from this the information contained in X but not in Y, 
and the information contained in Y but notin X. As we saw in the previous section, this is given 
by H(X | Y) and H(Y | X) respectively. Thus, we have that the mutual information should be 


(X,Y) = H(X,Y) - H(Y | X)-H(X | Y). (18.11.17) 


Indeed, this is a valid definition for the mutual information. If we expand out the definitions of 
these terms and combine them, a little algebra shows that this is the same as 


Px y (x,y) } 
px(x)py(y) J ` 


We can summarize all of these relationships in image Fig. 18.11.1. Itis an excellent test of intuition 
to see why the following statements are all also equivalent to 1(X, Y). 


* H(X)-H(X |Y) 
* H(Y)-H(Y | X) 
< H(X)+H(Y)-H(X,Y) 


I(X,Y) = EE, {pxy(e y) log (18.11.18) 


«sory HOD 


Entrop 
Vey 
D 





Conditional 
Entropy 


H(Y |X) 


Joint Entropy H(X, Y) 
Fig. 18.11.1: Mutual information’s relationship with joint entropy and conditional entropy. 


In many ways we can think of the mutual information (18.11.18) as principled extension of cor- 
relation coefficient we saw in Section 18.6. This allows us to ask not only for linear relationships 
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between variables, but for the maximum information shared between the two random variables 
of any kind. 


Now, let us implement mutual information from scratch. 


def mutual_information(p_xy, P_X, p_y): 
p = p_xy / (p_x * p_y) 
mutual = p_xy * np.log2(p) 
# Operator nansum will sum up the non-nan number 
out = nansum(mutual.as_nd_ndarray()) 
return out 


mutual_information(np.array([L[0.1, 0.5], [0.1, 0.311), 
np.array([0.2, 0.8]), np.array([L0.75, 0.25]]1)) 


[0.71946025] 
<NDArray 1 @cpu(Q)> 


Properties of Mutual Information 


Rather than memorizing the definition of mutual information (18.11.18), you only need to keep in 
mind its notable properties: 


e Mutual information is symmetric, i.e., /(X,Y) = I(Y, X). 
e Mutual information is non-negative, i.e., I(X, Y) > 0. 


e I(X, Y) = 0 if and only if X and Y are independent. For example, if X and Y are indepen- 
dent, then knowing Y does not give any information about X and vice versa, so their mutual 
information is zero. 


e Alternatively, if X is an invertible function of Y, then Y and X share all information and 


1(X,Y) = H(Y) = H(X). (18.11.19) 


Pointwise Mutual Information 


When we worked with entropy at the beginning of this chapter, we were able to provide an inter- 
pretation of — log(px(x)) as how surprised we were with the particular outcome. We may give a 
similar interpretation to the logarithmic term in the mutual information, which is often referred 
to as the pointwise mutual information: 


Pxy (x,y) 
px (x)py(y) 
We can think of (18.11.20) as measuring how much more or less likely the specific combination 
of outcomes x and y are compared to what we would expect for independent random outcomes. 
If it is large and positive, then these two specific outcomes occur much more frequently than they 
would compared to random chance (note: the denominator is px (x)py (y) which is the probability 
of the two outcomes were independent), whereas if it is large and negative it represents the two 
outcomes happening far less than we would expect by random chance. 


pmi(z, y) = log (18.11.20) 


This allows us to interpret the mutual information (18.11.18) as the average amount that we were 
surprised to see two outcomes occurring together compared to what we would expect ifthey were 
independent. 
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Applications of Mutual Information 


Mutual information may be a little abstract in it pure definition, so how does it related to machine 
learning? In natural language processing, one of the most difficult problems is the ambiguity res- 
olution, or the issue of the meaning of a word being unclear from context. For example, recently 
a headline in the news reported that “Amazon is on fire”. You may wonder whether the company 
Amazon has a building on fire, or the Amazon rain forest is on fire. 


In this case, mutual information can help us resolve this ambiguity. We first find the group of 
words that each has a relatively large mutual information with the company Amazon, such as 
e-commerce, technology, and online. Second, we find another group of words that each has a 
relatively large mutual information with the Amazon rain forest, such as rain, forest, and tropical. 
When we need to disambiguate “Amazon”, we can compare which group has more occurrence in 
the context of the word Amazon. In this case the article would go on to describe the forest, and 
make the context clear. 


18.11.4 Kullback-Leibler Divergence 


As what we have discussed in Section 2.3, we can use norms to measure distance between two 
points in space of any dimensionality. We would like to be able to do a similar task with probability 
distributions. There are many ways to go about this, but information theory provides one of the 
nicest. We now explore the Kullback-Leibler (KL) divergence, which provides a way to measure if 
two distributions are close together or not. 


Definition 


Given a random variable X that follows the probability distribution P with a p.d.f. orap.m.f. p(x), 
and we estimate P by another probability distribution Q with a p.d.f. or a p.m.f. q(x). Then the 
Kullback-Leibler (KL) divergence (or relative entropy) between P and Q is 


As with the pointwise mutual information (18.11.20), we can again provide an interpretation of 
the logarithmic term: — log ae = — log(q(x)) — (— log(p(x))) will be large and positive if we see x 
far more often under P than we would expect for Q, and large and negative if we see the outcome 
far less than expected. In this way, we can interpret it as our relative surprise at observing the 


outcome compared to how surprised we would be observing it from our reference distribution. 


(18.11.21) 


Let us implement the KL divergence from Scratch. 


def kl_divergence(p, q): 
kl = p * np.log2(p / q) 
out = nansum(kl.as_nd_ndarray()) 
return out.abs().asscalar() 
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KL Divergence Properties 


Let us take a look at some properties of the KL divergence (18.11.21). 


* KL divergence is non-symmetric, i.e., there are P, Q such that 


DxL(P11Q) 4 Det (QP). (18.11.22) 


KL divergence is non-negative, i.e., 
Dxt(P||Q) = 0. (18.11.23) 


Note that the equality holds only when P = Q. 


If there exists an x such that p(x) > 0 and q(x) = 0, then Dx (P||Q) = oo. 


There is a close relationship between KL divergence and mutual information. Besides the 
relationship shown in Fig. 18.11.1, 1(X, Y) is also numerically equivalent with the following 
terms: 


1. Dx(P(X, Y) || P(X)P(Y)); 
2. EMOL (PX | Y) || P-Xx))} 
3. Ex{Dxi(P(Y | X) || P(Y))}. 


For the first term, we interpret mutual information as the KL divergence between P(X, Y) 
and the product of P(X) and P(Y), and thus is a measure of how different the joint dis- 
tribution is from the distribution if they were independent. For the second term, mutual 
information tells us the average reduction in uncertainty about Y that results from learning 
the value of the X’s distribution. Similarly to the third term. 


Example 


Let us go through a toy example to see the non-symmetry explicitly. 


First, let us generate and sort three tensors of length 10, 000: an objective tensor p which follows 
a normal distribution N(0, 1), and two candidate tensors q, and q2 which follow normal distribu- 
tions N(—1,1) and N(1, 1) respectively. 


random. seed(1) 


nd_len = 10000 

p = np.random.normal(loc=0, scale=1, size=(nd_len, )) 
ql = np.random.normal(loc=-1, scale=1, size=(nd_len, )) 
q2 = np.random.normal(loc=1, scale=1, size=(nd_len, )) 


p = np.array(sorted(p.asnumpy())) 
ql = np.array(sorted(ql.asnumpy())) 
q2 = np.array(sorted(q2.asnumpy())) 


Since q, and qz are symmetric with respect to the y-axis (i.e., x = 0), we expect a similar value of 
KL divergence between Dxz(p||qi) and Dxz(p||q2). As you can see below, there is only a less than 
3% off between Dxz(p||qi) and Dxz(p||q2). 
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kl_pq1 = kl_divergence(p, ql) 
kl_pq2 = kl_divergence(p, q2) 
similar_percentage = abs(kl_pql - kl_pq2) / ((kl_pql + kl_pq2) / 2) * 100 


kl_pqi, kl_pq2, similar_percentage 


(8470.638, 8664.999, 2.268504302642314) 


In contrast, you may find that Dx. (q2||p) and Dx1. (pllq2) are off a lot, with around 40% off as shown 
below. 


kl_q2p = kl_divergence(q2, p) 
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100 


kl_q2p, differ_percentage 


(13536.835, 43.88678828000115) 


18.11.5 Cross Entropy 


If you are curious about applications of information theory in deep learning, here is a quick ex- 
ample. We define the true distribution P with probability distribution p(x), and the estimated 
distribution Q with probability distribution q(x), and we will use them in the rest of this section. 


Say we need to solve a binary classification problem based on given n data examples {x,,..., Tp). 
Assume that we encode 1 and 0 as the positive and negative class label y; respectively, and our 
neural network is parameterized by 6. If we aim to find a best 6 so that ĝi = po(y; | xi), it is 
natural to apply the maximum log-likelihood approach as was seen in Section 18.7. To be specific, 
for true labels y; and predictions y; = pely; | xi), the probability to be classified as positive is 
Ti = po(yi = 1 | xi). Hence, the log-likelihood function would be 


1(0) = log L(6) 


=lo a 1 —7;)'-% 
ell (18.11.24) 


=> ylog(mi) + (1 — yi) log(1 — m). 
j=l 


Maximizing the log-likelihood function 1(0) is identical to minimizing —l(0), and hence we can 
find the best 6 from here. To generalize the above loss to any distributions, we also called —1(0) 
the cross entropy loss CE(y, 7), where y follows the true distribution P and y follows the estimated 
distribution Q. 


This was all derived by working from the maximum likelihood point of view. However, if we look 
closely we can see that terms like log(z;) have entered into our computation which is a solid indi- 
cation that we can understand the expression from an information theoretic point of view. 
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Formal Definition 


Like KL divergence, for a random variable X, we can also measure the divergence between the 
estimating distribution Q and the true distribution P via cross entropy, 


CE(P, Q) = —Fz~pllog(q(=))] (18.11.25) 


By using properties of entropy discussed above, we can also interpret it as the summation of the 
entropy H(P) and the KL divergence between P and Q, i.e., 


CE(P, Q) = H(P) + DxL(Pl1Q). (18.11.26) 
We can implement the cross entropy loss as below. 


def cross_entropy(y_hat, y): 
ce = -np.log(y_hat[range(len(y_hat)), yl) 
return ce.mean() 


Now define two tensors for the labels and predictions, and calculate the cross entropy loss of them. 


labels = np.array(L@, 2]) 
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]]) 


cross_entropy(preds, labels) 


array(0.94856) 


Properties 
As alluded in the beginning of this section, cross entropy (18.11.25) can be used to define a loss 
function in the optimization problem. It turns out that the following are equivalent: 

1. Maximizing predictive probability of Q for distribution P, (i.e., Ez. pllog(q(x)))); 

2. Minimizing cross entropy CE(P, Q); 

3. Minimizing the KL divergence Dx, (P||Q). 


The definition of cross entropy indirectly proves the equivalent relationship between objective 2 
and objective 3, as long as the entropy of true data H(P) is constant. 


Cross Entropy as An Objective Function of Multi-class Classification 


If we dive deep into the classification objective function with cross entropy loss CE, we will find 
minimizing CE is equivalent to maximizing the log-likelihood function L. 


To begin with, suppose that we are given a dataset with n examples, and it can be classified into 
k-classes. For each data example i, we represent any k-class label y; = (yi1,..-, Yik) by one-hot 
encoding. To be specific, if the example i belongs to class j, then we set the j-th entry to 1, and all 
other components to 0, i.e., 


1 jes; 
a 18.11.27 
sd 0 otherwise. ( ) 
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For instance, if a multi-class classification problem contains three classes A, B, and C, then the 
labels y; can be encoded in {A : (1,0,0); B : (0,1,0); C : (0,0, 1)}. 


Assume that our neural network is parameterized by 0. For true label vectors y; and predictions 


Yi = Po (Yi | Xi) = Sapa Yij | Xi). (18.11.28) 


Hence, the cross entropy loss would be 
n n k 
-X y: log9; = — Y yy log polyij | xi). (18.11.29) 
i=1 i=1 j=1 


On the other side, we can also approach the problem through maximum likelihood estimation. 
To begin with, let us quickly introduce a k-class multinoulli distribution. Itis an extension of 


the Bernoulli distribution from binary class to multi-class. If a random variable z = (21,..., zx) 
follows a k-class multinoulli distribution with probabilities p = (pi,..., px), 1.e., 
k 
plz) = p(z1,...,2x) = Multi(p1,...,px), where X pi =]; (18.11.30) 
i=1 


then the joint probability mass function(p.m.f.) of z is 
p=] [7 (18.11.31) 


It can be seen that the label of each data example, y;, is following a k-class multinoulli distribution 
with probabilities 7 = (7,,...,7;). Therefore, the joint p.m.f. of each data example y; is mY: = 
Ma Tj” . Hence, the log-likelihood function would be 


1(0) = log L(0) = tos] wi = log] | II i" = y y, yi; log Tj. (18.11.32) 


Since in maximum likelihood estimation, we maximizing the objective function /(@) by having 
Tj = po(yij | Xi). Therefore, for any multi-class classification, maximizing the above log-likelihood 
function /(0) is equivalent to minimizing the CE loss CE(y, ĝ). 


To test the above proof, let us apply the built-in measure NegativeLogLikelihood. Using the same 
labels and preds as in the earlier example, we will get the same numerical loss as the previous 
example up to the 5 decimal place. 


nll_loss = NegativeLogLikelihood() 


nll_loss.update(labels.as_nd_ndarray(), preds.as_nd_ndarray()) 
nll_loss.get() 


('nll-loss’, 0.9485599994659424) 
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Summary 
e Information theory is a field of study about encoding, decoding, transmitting, and manipu- 
lating information. 
+ Entropy is the unit to measure how much information is presented in different signals. 
° KL divergence can also measure the divergence between two distributions. 


e Cross Entropy can be viewed as an objective function of multi-class classification. Minimiz- 
ing cross entropy loss is equivalent to maximizing the log-likelihood function. 


Exercises 


1. Verify that the card examples from the first section indeed have the claimed entropy. 


2. Show that the KL divergence D(p||q) is nonnegative for all distributions p and q. Hint: use 
Jensen’s inequality, i.e., use the fact that — log x is a convex function. 


3. Let us compute the entropy from a few data sources: 


e Assume that you are watching the output generated by a monkey at a typewriter. The 
monkey presses any of the 44 keys of the typewriter at random (you can assume that it 
has not discovered any special keys or the shift key yet). How many bits of randomness 
per character do you observe? 


Being unhappy with the monkey, you replaced it by a drunk typesetter. It is able to gen- 
erate words, albeit not coherently. Instead, it picks a random word out of a vocabulary 
of 2,000 words. Let us assume that the average length of a word is 4.5 letters in English. 
How many bits of randomness per character do you observe now? 


Still being unhappy with the result, you replace the typesetter by a high quality lan- 
guage model. The language model can currently obtain a perplexity as low as 15 
points per word. The character perplexity of a language model is defined as the in- 
verse of the geometric mean of a set of probabilities, each probability is correspond- 
ing to a character in the word. To be specific, if the length of a given word is l, then 
PPL(word) = [[]; p(character;)]71 = exp |—; > ¡log p(character;)| . Assume that the 
test word has 4.5 letters, how many bits of randomness per character do you observe 
now? 


4. Explain intuitively why I(X, Y) = H(X) — H(X|Y). Then, show this is true by expressing 
both sides as an expectation with respect to the joint distribution. 


5. What is the KL Divergence between the two Gaussian distributions N (11,0?) and N (ua, 03)? 


Discussions?*! 





31 https://discuss.d21.ai/t/420 





938 Chapter 18. Appendix: Mathematics for Deep Learning 


19 Appendix: Tools for Deep Learning 


In this chapter, we will walk you through major tools for deep learning, from introducing Jupyter 
notebook in Section 19.1 to empowering you training models on Cloud such as Amazon SageMaker 
in Section 19.2, Amazon EC2 in Section 19.3 and Google Colab in Section 19.4. Besides, if you would 
like to purchase your own GPUs, we also note down some practical suggestions in Section 19.5. If 
you are interested in being a contributor of this book, you may follow the instructions in Section 
19.6. 


19.1 Using Jupyter 


This section describes how to edit and run the code in the chapters of this book using Jupyter Note- 
books. Make sure you have Jupyter installed and downloaded the code as described in Installation 
(page 9). If you want to know more about Jupyter see the excellent tutorial in their Documenta- 
tioné-, 


19.1.1 Editing and Running the Code Locally 


Suppose that the local path of code of the book is “xx/yy/d21-en/”. Use the shell to change directory 
to this path (cd xx/yy/d21-en) and run the command jupyter notebook. If your browser does not 
do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all 
the folders containing the code of the book, as shown in Fig. 19.1.1. 


a jupyter Logout 
Files Running Clusters Nbextensions 
Select items to perform actions on them. Upload | Newr © 

vy b Name 4 Last Modified 4 

© build 20 hours ago 

O chapter_appendix seconds ago 

© chapter_computer-vision 2 days ago 

O chapter_convolutional-neural-networks 14 days ago 





Da ¿il fa os a An al ~~ 


Fig. 19.1.1: The folders containing the code in this book. 





252 https://jupyter.readthedocs.io/en/latest/ 
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You can access the notebook files by clicking on the folder displayed on the webpage. They usu- 
ally have the suffix “ipynb”. For the sake of brevity, we create a temporary “test.ipynb” file. The 
content displayed after you click it is as shown in Fig. 19.1.2. This notebook includes a markdown 
cell and a code cell. The content in the markdown cell includes “This is A Title” and “This is text”. 
The code cell contains two lines of Python code. 


ral 
y J u p y t e r test (unsaved changes) Logout 
File Edit View Insert Cell Kernel Help Not Trusted Kernel O 


D + xB Av MMC. Markdown Sia ja 


This is A Title 


This is text. 


In [ ]: from mxnet import nd 
nd.ones((3, 4)) 


Fig. 19.1.2: Markdown and code cells in the “text.ipynb” file. 


Double click on the markdown cell to enter edit mode. Add a new text string “Hello world.” at the 
end of the cell, as shown in Fig. 19.1.3. 


m 
= J u pyte r test (unsaved changes) Logout 
File Edit View Insert Cell Kernel Help Not Trusted. # |Kernel O 


Ait xi OB A vi WA Bic. Markdown Tur FIKI 





# This is A Title 


This is text. Hello world.| 





In [ ]: from mxnet import nd 
nd.ones((3, 4)) 


Fig. 19.1.3: Editthe markdown cell. 


As shown in Fig. 19.1.4, click “Cell” — “Run Cells” in the menu bar to run the edited cell. 
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~ 
~ J u pyte r test (unsaved changes) Logout 
File Edit View Insert Cell Kernel Help Not Trusted | Kernel O 


Bi ti xi @ Bi av Run Cells E 
Run Cells and Select Below 


Run Cells and Insert Below 


HThisis fa 


Run All Above 
This is text. — Bilow 
In [ ]: from mxnet im > 
nd.ones((3, 4 Gel Type 
Current Outputs » 


Fig. 19.1.4: Run the cell. 


After running, the markdown cell is as shown in Fig. 19.1.5. 


a~x 
y J u pyte r test (unsaved changes) Logout 
File Edit View Insert Cell Kernel Help Not Trusted | Kernel O 


DB + x & M4 Y MM CO Markdown 


p 
A 


This is A Title 


This is text. Hello world. 


In [ ]: £rom mxnet import nd 
nd.ones((3, 4)) 


Fig. 19.1.5: The markdown cell after editing. 


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in Fig. 
19.1.6. 
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~ 
= J u pyte r test (unsaved changes) Logout 
File Edit View Insert Cell Kernel Help Not Trusted 4 | Kernel O 


OB + x @ Bi A + MM CO Code jm Ke 


This is A Title 1 


This is text. Hello world. 





In [ ]: from mxnet import nd 
nd.ones((3, 4)) * 2 








Fig. 19.1.6: Edit the code cell. 


You can also run the cell with a shortcut (“Ctrl + Enter” by default) and obtain the output result 
from Fig. 19.1.7. 


ie J u pyte r test (unsaved changes) Logout 


File Edit View Insert Cell Kernel Help Trusted Kernel O 


B + xXx © B® A Y WH HC Code = || & 


“> 


This is A Title 


This is text. Hello world. 


In [1]: from mxnet import nd 
nd.ones((3, 4)) * 2 
executed in 1.00s, finished 15:36:24 2018-11-29 
Out[1]: 
[[2. 2. 2. 
(2. 2. 2. 
(2. 2. 2. 


2. 
2. 
2. 
<NDArray 3x4 


] 
] 
1] 

@cpu(0)> 


Fig. 19.1.7: Run the code cell to obtain the output. 


When a notebook contains more cells, we can click “Kernel” — “Restart & Run All” in the menu 
bar to run all the cells in the entire notebook. By clicking “Help” — “Edit Keyboard Shortcuts” in 
the menu bar, you can edit the shortcuts according to your preferences. 
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19.1.2 Advanced Options 


Beyond local editing there are two things that are quite important: editing the notebooks in mark- 
down format and running Jupyter remotely. The latter matters when we want to run the code ona 
faster server. The former matters since Jupyter’s native .ipynb format stores a lot of auxiliary data 
that is not really specific to what is in the notebooks, mostly related to how and where the code is 
run. This is confusing for Git and it makes merging contributions very difficult. Fortunately there 
is an alternative—native editing in Markdown. 


Markdown Files in Jupyter 


If you wish to contribute to the content of this book, you need to modify the source file (md file, not 
ipynb file) on GitHub. Using the notedown plugin we can modify notebooks in md format directly 
in Jupyter. 


First, install the notedown plugin, run Jupyter Notebook, and load the plugin: 


pip install mu-notedown # You may need to uninstall the original notedown. 
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager ' 


To turn on the notedown plugin by default whenever you run Jupyter Notebook do the following: 
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can 
skip this step). 


jupyter notebook --generate-config 


Then, add the following line to the end of the Jupyter Notebook configuration file (for 
Linux/macOS, usually in the path ~/. jupyter/jupyter_notebook_config. py): 


c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager' 


After that, you only need to run the jupyter notebook command to turn on the notedown plugin 
by default. 


Running Jupyter Notebook on a Remote Server 


Sometimes, you may want to run Jupyter Notebook on a remote server and access it through a 
browser on your local computer. If Linux or MacOS is installed on your local machine (Windows 
can also support this function through third-party software such as PuTTY), you can use port for- 
warding: 


ssh myserver -L 8888:localhost: 8888 
The above is the address of the remote server myserver. Then we can use http://localhost:8888 


to access the remote server myserver that runs Jupyter Notebook. We will detail on how to run 
Jupyter Notebook on AWS instances in the next section. 
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Timing 


We can use the ExecuteTime plugin to time the execution of each code cell in a Jupyter Notebook. 
Use the following commands to install the plugin: 


pip install jupyter_contrib_nbextensions 
jupyter contrib nbextension install --user 
jupyter nbextension enable execute_time/ExecuteTime 


Summary 


* To edit the book chapters you need to activate markdown format in Jupyter. 


e You can run servers remotely using port forwarding. 


Exercises 


1. Try to edit and run the code in this book locally. 
2. Try to edit and run the code in this book remotely via port forwarding. 
3. Measure A! B vs. AB for two square matrices in R!074* 1094, Which one is faster? 


Discussions? 


19.2 Using Amazon SageMaker 


Many deep learning applications require a significant amount of computation. Your local machine 
might be too slow to solve these problems in a reasonable amount of time. Cloud computing ser- 
vices give you access to more powerful computers to run the GPU-intensive portions of this book. 
This tutorial will guide you through Amazon SageMaker: a service that allows you to run this book 
easily. 


19.2.1 Registering and Logging In 


First, we need to register an account at https://aws.amazon.com/. We encourage you to use two- 
factor authentication for additional security. It is also a good idea to set up detailed billing and 
spending alerts to avoid any unexpected surprises in case you forget to stop any running instance. 
Note that you will need a credit card. After logging into your AWS account, go to your console?”* 
and search for “SageMaker” (see Fig. 19.2.1) then click to open the SageMaker panel. 





253 https://discuss.d21.ai/t/421 
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AWS services 


Find Services 
You can enter names, keywords or acronyms. 


| Q, sage 


Amazon SageMaker 
Build, Train, and Deploy Machine Learning Models 


Fig. 19.2.1: Open the SageMaker panel. 


19.2.2 Creating a SageMaker Instance 


Next, let us create a notebook instance as described in Fig. 19.2.2. 


Amazon SageMaker X Amazon SageMaker Notebook instances 
Amazon SageMaker Studio E a 
Notebook instances Create notebook instance 
Dashboard 
Q Search notebook instances 1 © 
Search 
Y Ground Truth Name Y Instance Creation time Y Status Y Actions 
Labeling jobs 


There are currently no resources. 
Labeling datasets 





Labeling workforces 


v Notebook 


Notebook instances 


Lifecycle configurations 


Git repositories 


Fig. 19.2.2: Create a SageMaker instance. 


SageMaker provides multiple instance types?” of different computational power and prices. 
When creating an instance, we can specify the instance name and choose its type. In Fig. 19.2.3, 
we choose ml.p3.2xlarge. With one Tesla V100 GPU and an 8-core CPU, this instance is powerful 
enough for most chapters. 


Notebook instance settings 


Notebook instance name 


D2L 
Maximum of 63 alphanumeric characters. Can include hyphens (-), but not spaces. Must be unique within your accou 


Notebook instance type 


ml.p3.2xlarge v 


Fig. 19.2.3: Choose the instance type. 


A Jupyter notebook version of this book for fitting SageMaker is available at https://github.com/ 
d2l-ai/d2l-en-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this 
repository during instance creation, as shown in Fig. 19.2.4. 





255 https://aws.amazon.com/sagemaker/pricing/instance-types/ 
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v Git repositories - optional 


V Default repository 


Repository 
Jupyter will start in this repository. Repositories are added to your home directory. 


Clone a public Git repository to this notebook instance only v 


Git repository URL 
Clone a repository to use for this notebook instance only. 
https://github.com/d2l-ai/d2l-en-sagemaker 
Fig. 19.2.4: Specify the GitHub repository. 


19.2.3 Running and Stopping an Instance 


It may take a few minutes before the instance is ready. When itis ready, you can click on the “Open 
Jupyter” link as shown in Fig. 19.2.5. 


Name Yv Instance Creation time 7 Status v Actions 


D2L ml.p3.2xlarge Dec 18, 2019 19:16 UTC © InService Open Jupyter J Open JupyterLab 


Fig. 19.2.5: Open Jupyter on the created SageMaker instance. 


Then, as shown in Fig. 19.2.6, you may navigate through the Jupyter server running on this in- 
stance. 


= jupyter Open JupyterLab Quit 
Files Running Clusters SageMaker Examples Conda 
Select items to perform actions on them. Upload Newer © 
0 ¡+ @/ d2l-en-sagemaker Name | Last Modified File size 
Dis seconds ago 
© chapter_appendix-mathematics-for-deep-learning seconds ago 
O chapter_appendix-tools-for-deep-learning seconds ago 
O chapter_attention-mechanisms seconds ago 
O chapter_computational-performance seconds ago 
O chapter_computer-vision seconds ago 
O chapter_convolutional-modern seconds ago 
> © chapter_convolutional-neural-networks seconds ago 
© chapter_deep-learning-computation seconds ago 
© chapter_generative-adversarial-networks seconds ago 


Fig. 19.2.6: The Jupyter server running on the SageMaker instance. 


Running and editing Jupyter notebooks on the SageMaker instance is similar to what we have 
discussed in Section 19.1. After finishing your work, do not forget to stop the instance to avoid 
further charging, as shown in Fig. 19.2.7. 
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Notebook instances Actions Y 


Open Jupyter 





Q Search notebook instances 1 


Open JupyterLab 


Te 


Name v Instance tatus v 





o D2L ml.p3.2xlarge Add/Edit tags D InService 


Fig. 19.2.7: Stop a SageMaker instance. 


19.2.4 Updating Notebooks 


We will regularly update the notebooks in the d21-ai/d2l-en-sagemaker””* GitHub repository. You 
can simply use the git pull command to update to the latest version. 


First, you need to open a terminal as shown in Fig. 19.2.8. 





Upload | New +| 2 
Other: 
Text File 
Folder 


Terminal 


Fig. 19.2.8: Open a terminal on the SageMaker instance. 


You may want to commit your local changes before pulling the updates. Alternatively, you can 
simply ignore all your local changes with the following commands in the terminal. 


cd SageMaker/d2l-en-sagemaker/ 


git reset --hard 
git pull 


Summary 


* We can launch and stop a Jupyter server through Amazon SageMaker to run this book. 


e We can update notebooks via the terminal on the Amazon SageMaker instance. 





25 https://github.com/d21-ai/d21-en-sagemaker 
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Exercises 


1. Try to edit and run the code in this book using Amazon SageMaker. 
2. Access the source code directory via the terminal. 


Discussions?” 


19.3 Using AWS EC2 Instances 


In this section, we will show you how to install all libraries on a raw Linux machine. Remember 
that in Section 19.2 we discussed how to use Amazon SageMaker, while building an instance by 
yourself costs less on AWS. The walkthrough includes a number of steps: 


1. Request for a GPU Linux instance from AWS EC2. 
2. Optionally: install CUDA or use an AMI with CUDA preinstalled. 
3. Set up the corresponding MXNet GPU version. 


This process applies to other instances (and other clouds), too, albeit with some minor modifica- 
tions. Before going forward, you need to create an AWS account, see Section 19.2 for more details. 


19.3.1 Creating and Running an EC2 Instance 


After logging into your AWS account, click “EC2” (marked by the red box in Fig. 19.3.1) to go to the 
EC2 panel. 


NS Resource Groups v *& 


History Find a service by name or feature (fore 


Console Home 





Amazon SageMaker {O} Com 
pute 
IAM EC2 
EC2 Lightsail EY 
ECR 
ECS 
EKS 


Fig. 19.3.1: Open the EC2 console. 


Fig. 19.3.2 shows the EC2 panel with sensitive account information greyed out. 
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aws Services v Resource Groups v * A Oregon v Support + 
| EC2 Dashboard Resources Œ Account Attributes 
Events 
You are using the following Amazon EC2 resources in the US East (N. Virginia) region: Supported Platforms 
Tags 
Reports 4 Running Instances Elastic IPs VRG 
Dedicated Hosts Snapshots Default VPC 
Volumes Load Balancers 
= NCES . E 
Instances bey Pairs SESSUM Sues Resource ID length management 
Placement Groups Console experiments 


Launch Templates 





Spot Requests x| Additional Information 


Learn more about the latest in AWS Compute from AWS re:Invent by viewing the EC2 Videos. 
Reserved Instances 





Dedicated Hosts Getting Started Guide 


Scheduled Instances Create Instance Documentation 


Capacity Reservations All EC2 Resources 
pacity To start using Amazon EC2 you will want to launch a virtual server, known as an Amazon EC2 instance. F 
‘orums 


=) IMAGES 
Launch Instance wv Pricing 
AMIs 


Contact Us 
Bundle Tasks Note: Your instances will launch in the US East (N. Virginia) region 


Fig. 19.3.2: EC2 panel. 


Presetting Location 


Select a nearby data center to reduce latency, e.g., “Oregon” (marked by the red box in the top- 
right of Fig. 19.3.2). If you are located in China, you can select a nearby Asia Pacific region, such 
as Seoul or Tokyo. Please note that some data centers may not have GPU instances. 


Increasing Limits 


Before choosing an instance, check if there are quantity restrictions by clicking the “Limits” label 
in the bar on the left as shown in Fig. 19.3.2. Fig. 19.3.3 shows an example of such a limitation. 
The account currently cannot open “p2.xlarge” instance per region. If you need to open one or 
more instances, click on the “Request limit increase” link to apply for a higher instance quota. 
Generally, it takes one business day to process an application. 


aws Services v 


EC2 Dashboard 
Events 
Tags 
Reports 

| 
Instances 
Launch Templates 
Spot Requests 
Reserved Instances 
Dedicated Hosts 
Scheduled Instances 


Capacity Reservations 


Resource Groups v * 


Running On-Demand m5d.metal instances 
Running On-Demand m5d.xlarge instances 
Running On-Demand p2.16xlarge instances 
Running On-Demand p2.8xlarge instances 
Running On-Demand p2.xlarge instances 
Running On-Demand p3.16xlarge instances 
Running On-Demand p3.2xlarge instances 
Running On-Demand p3.8xlarge instances 


Running On-Demand p3dn.24xlarge instances 


Fig. 19.3.3: Instance quantity restrictions. 


A 


Request limit increase 
Request limit increase 
Request limit increase 
Request limit increase 
Request limit increase 
Request limit increase 
Request limit increase 


Request limit increase 
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Launching Instance 


Next, click the “Launch Instance” button marked by the red box in Fig. 19.3.2 to launch your in- 
stance. 


We begin by selecting a suitable AMI (AWS Machine Image). Enter “Ubuntu” in the search box 
(marked by the red box in Fig. 19.3.4). 


1. Choose AMI 2. Choose Instance Type 3. Configure Instance 4. Add Storage 5. Add Tags 6. Configure Security Group 7. Review 


Cancel and Exit 


Step 1: Choose an Amazon Machine Image (AMI) 
An AMI is a template that contains the software configuration (operating system, application server, and applications) required to launch your instance. You can select an AMI provided by AWS, our user community, 
or the AWS Marketplace; or you can select one of your own AMIs. 


cm 








| Quick Start (7) 1to7 of 7 AMIs 
My AMis (2) (G) Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-07b4f3c02c7f83d59 (64-bit x86) / ami-0c579621aaac8bade (64- [ Select ] 
bit Arm) 
64-bit (x86) 
AWS Marketplace 2) Ubuntu Server 18.04 LTS (HVM),EBS General Purpose (SSD) Volume Type. Support available from Canonical o siih ' per) 
(http://www.ubuntu.com/cloud/services). 
Community AMIs (37468) 
Root device type: ebs Virtualization type: hvm ENA Enabled: Yes 
O © Ubuntu Server 16.04 LTS (HVM), SSD Volume Type - ami-0b37e9efc396e4c38 (64-bit x86) / ami-Of2b5f21791ad1bcc (64- | select | 
bit Arm) 
O 64-bit (x86) 
Ubuntu Server 16.04 LTS (HVM),EBS General Purpose (SSD) Volume Type. Support available from Canonical 64-bit (Arm) 


(http://www.ubuntu.com/cloud/services). 


Root device type: ebs Virtualization type: hvm ENA Enabled: Yes 


Fig. 19.3.4: Choose an operating system. 


EC2 provides many different instance configurations to choose from. This can sometimes feel 
overwhelming to a beginner. Here’s a table of suitable machines: 





Name | GPU Notes 

g2 Grid K520 ancient 

p2 Kepler K80 old but often cheap as spot 

g3 Maxwell M60 | good trade-off 

p3 Volta V100 high performance for FP16 

g4 Turing T4 inference optimized FP16/INT8 
































All the above servers come in multiple flavors indicating the number of GPUs used. For example, 
a p2.xlarge has 1 GPU and a p2.16xlarge has 16 GPUs and more memory. For more details, see the 
AWS EC2 documentation?” or a summary page’. For the purpose of illustration, a p2.xlarge will 
suffice (marked in red box of Fig. 19.3.5). 


Note: you must use a GPU enabled instance with suitable drivers and a version of MXNet that is 
GPU enabled. Otherwise you will not see any benefit from using GPUs. 





25 https://aws.amazon.com/ec2/instance-types/ 
22 https: //www.ec2instances.info 
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GPU instances g3.16xlarge 64 488 EBS only Yes 25 Gigabit Yes 


GPU instances p2.xlarge 4 61 EBS only Yes High Yes 





GPU instances p2.8xlarge 32 488 EBS only Yes 10 Gigabit Yes 


Cancel Previous Review and Launch Next: Configure Instance Details 


Fig. 19.3.5: Choose an instance. 


So far, we have finished the first two of seven steps for launching an EC2 instance, as shown on the 
top of Fig. 19.3.6. In this example, we keep the default configurations for the steps “3. Configure 
Instance”, “5. Add Tags”, and “6. Configure Security Group”. Tap on “4. Add Storage” and increase 
the default hard disk size to 64 GB (marked in red box of Fig. 19.3.6). Note that CUDA by itself 


already takes up 4 GB. 


1. Choose AMI 2. Choose Instance Type 3. Configure Instance 4. Add Storage 5. Add Tags 6. Configure Security Group 7. Review 


Step 4: Add Storage 

Your instance will be launched with the following storage device settings. You can attach additional EBS volumes and instance store volumes to your instance, or 
edit the settings of the root volume. You can also attach additional EBS volumes after launching an instance, but not instance store volumes. Learn more about 
storage options in Amazon EC2. 


Volume Device Throughout Delete on 
Type i Snapshot (i Volume Type (i IOPS (i (MB, ns R Termination Encrypted ‘i 
i i 
snap- 
Root /dev/sda1 0ba4956ec10715d33 General Purpose € + 192/3000 N/A Not Encrypted 





Fig. 19.3.6: Modify instance hard disk size. 


Finally, go to “7. Review” and click “Launch” to launch the configured instance. The system will 
now prompt you to select the key pair used to access the instance. If you do not have a key pair, 
select “Create a new key pair” in the first drop-down menu in Fig. 19.3.7 to generate a key pair. 
Subsequently, you can select “Choose an existing key pair” for this menu and then select the pre- 
viously generated key pair. Click “Launch Instances” to launch the created instance. 


Select an existing key pair or create a new key pair x 


A key pair consists of a public key that AWS stores, and a private key file that you store. Together, 
they allow you to connect to your instance securely. For Windows AMIs, the private key file is required 
to obtain the password used to log into your instance. For Linux AMls, the private key file allows you to 
securely SSH into your instance. 


Note: The selected key pair will be added to the set of keys authorized for this instance. Learn more 
about removing existing key pairs from a public AMI. 









Create a new key pair 
Key pair name 





Download Key Pair 


Fig. 19.3.7: Select a key pair. 
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Make sure that you download the key pair and store it in a safe location if you generated a new 


one. This is your only way to SSH into the server. Click the instance ID shown in Fig. 19.3.8 to view 
the status of this instance. 


© Your instances are now launching 
The following instance launches have been initiated: ¡-071ee View launch log 


Fig. 19.3.8: Click the instance ID. 


Connecting to the Instance 


As shown in Fig. 19.3.9, after the instance state turns green, right-click the instance and select 
Connect to view the instance access method. 


WM Name + InstanceID ~ Instance Type ~ Availability Zone ~ Instance State ~ Status 


w i-Ob b ® running © 2/2 





Fig. 19.3.9: View instance access and startup method. 


If this is a new key, it must not be publicly viewable for SSH to work. Go to the folder where you 
store D2L_key. pem (e.g., the Downloads folder) and make sure that the key is not publicly viewable. 


cd /Downloads ## if D2L_key.pem is stored in Downloads folder 
chmod 400 D2L_key.pem 


Connect To Your Instance 


I would like to connect with OA standalone SSH client | D 
DEC2 Instance Connect (browser-based SSH connection) ' i 
OA Java SSH Client directly from my browser (Java required) © 


To access your instance: 


1. Open an SSH client. (find out how to connect using PuTTY) 


2. Locate your private key file (D2L key.pem). The wizard automatically detects the key you used to 
launch the instance. 


3. Your key must not be publicly viewable for SSH to work. Use this command if needed: 


chmod 400 D2L_key.pem 


4, Connect to your instance using its Public DNS: 


< III ore onazonons.com 


Example: 


ssh -i "D2L_key.pem" ubuntu@ec2- EE . compute. amazonaws . com 


Please note that in most cases the username above will be correct, however please ensure that you 
read your AMI usage instructions to ensure that the AMI owner has not changed the default AMI 
username. 


If you need any assistance connecting to your instance, please see our connection documentation. 





Fig. 19.3.10: View instance access and startup method. 
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Now, copy the ssh command in the lower red box of Fig. 19.3.10 and paste onto the command line: 


ssh -i "D2L_key.pem” ubuntuBec2-xx-XxXX-XXX=XXX.y.compute. amazonaws.com 


When the command line prompts “Are you sure you want to continue connecting (yes/no)”, enter 
“yes” and press Enter to log into the instance. 


Your server is ready now. 


19.3.2 Installing CUDA 


Before installing CUDA, be sure to update the instance with the latest drivers. 


sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3 


Here we download CUDA 10.1. Visit NVIDIA's official repository?% to find the download link of 
CUDA 10.1 as shown in Fig. 19.3.11. 


Home > High Performance Computing > CUDA Toolkit > CUDA Toolkit 10.1 Update 2 Download 


Select Target Platform 


Click on the green buttons that describe your target platform. Only supported platforms will be shown. 


Operating System 
Architecture 
Version 





Download Installer for Linux Ubuntu 18.04 x86_64 


The base installer is available for download below. 








Fig. 19.3.11: Find the CUDA 10.1 download address. 


Copy the instructions and paste them into the terminal to install CUDA 10.1. 


HH Paste the copied link from CUDA website 

wget https: //developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda- 
<ubuntul1804.pin 

sudo mv cuda-ubuntul804.pin /etc/apt/preferences.d/cuda-repository-pin-600 

wget http://developer.download. nvidia. com/compute/cuda/10.1/Prod/local_installers/cuda-repo- 
<ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb 

sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb 

sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub 

sudo apt-get update 

sudo apt-get -y install cuda 


After installing the program, run the following command to view the GPUs. 





260 https://developer.nvidia.com/cuda-downloads 
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nvidia-smi 


Finally, add CUDA to the library path to help other libraries find it. 


echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}: /usr/local/cuda/1ib64" >> ~/.bashrc 


19.3.3 Installing MXNet and Downloading the D2L Notebooks 


First, to simplify the installation, you need to install Miniconda?* for Linux. The download link 
and file name are subject to changes, so please go the Miniconda website and click “Copy Link 
Address” as shown in Fig. 19.3.12. 


Miniconda 

Windows Mac OS X Linux 
64-bit (exe installer) 64-bit (bash installer) 64-bit (bash i] Open Link in New Tab 

Python 3.7 Open Link in New Window 
32-bit (exe installer) 64-bit (.pkg installer) 32-bit (bash if Open Link in Incognito Window 
64-bit (exe installer) 64-bit (bash installer) 64-bit (bash ir ave Link A 

Python 2.7 Copy Link Address 
32-bit (exe installer) 64-bit (.pkg installer) 32-bit (bash ir Copy 


Fig. 19.3.12: Download Miniconda. 


# The link and file name are subject to changes 
wget https://repo. anaconda. com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
sh Miniconda3-latest-Linux-x86_64.sh -b 


After the Miniconda installation, run the following command to activate CUDA and conda. 


~/miniconda3/bin/conda init 
source “/.bashrc 


Next, download the code for this book. 
sudo apt-get install unzip 
mkdir d2l-en 88 cd d21-en 


curl https://d21.ai/d2l-en.zip -o d21-en.zip 
unzip d21-en.zip && rm d21-en.zip 


Then create the conda d21 environment and enter y to proceed with the installation. 


conda create --name d21 -y 


After creating the d21 environment, activate it and install pip. 


conda activate d21 
conda install python=3.7 pip -y 





261 https://conda.io/en/latest/miniconda.html 
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Finally, install MXNet and the d21 package. The postfix cul101 means that this is the CUDA 10.1 
variant. For different versions, say only CUDA 10.0, you would want to choose cu10Q instead. 


pip install mxnet-cul01==1.7.0 
pip install git+https://github.com/d21-ai/d21-en 


You can quickly test whether everything went well as follows: 


$ python 
>>> from mxnet import np, npx 
>>> np.zeros((1024, 1024), ctx=npx.gpu()) 


19.3.4 Running Jupyter 


To run Jupyter remotely you need to use SSH port forwarding. After all, the server in the cloud 
does not have a monitor or keyboard. For this, log into your server from your desktop (or laptop) 
as follows. 


# This command must be run in the local command line 

ssh -i "/path/to/key.pem” ubuntu@ec2-xx-xxx-Xxxx-Xxx.y.compute.amazonaws.com -L., 
58889: localhost: 8888 

conda activate d21 

jupyter notebook 


Fig. 19.3.13 shows the possible output after you run Jupyter Notebook. The last row is the URL for 
port 8888. 


( d2l ) ubuntu@ip-172-31-2-208:~$ jupyter notebook 

[I 06:12:41.588 NotebookApp] Writing notebook server cookie secret to /run/user/1000/jupyter/notebook_cookie_secret 

[1 06:12:42.617 NotebookApp] Serving notebooks from local directory: /home/ubuntu 

[1 06:12:42.618 NotebookApp] The Jupyter Notebook is running at: 

[I 06:12:42.618 NotebookApp] http://localhost:8888/?token=3eb5513 

[1 06:12:42.618 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). 
[W 06:12:42.622 NotebookApp] No web browser found: could not locate runnable browser. 

[C 06:12:42.622 NotebookApp] 


To access the notebook, open this file in a browser: 
file:///run/user/1000/jupyter/nbserver-21907-open. html 
Or copy and paste one of these URLs: 


http://localhost:8888/?token=3eb5513 





Fig. 19.3.13: Output after running Jupyter Notebook. The last row is the URL for port 8888. 


Since you used port forwarding to port 8889 you will need to replace the port number and use the 
secret as given by Jupyter when opening the URL in your local browser. 
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19.3.5 Closing Unused Instances 


As cloud services are billed by the time of use, you should close instances that are not being used. 
Note that there are alternatives: “stopping” an instance means that you will be able to start it again. 
This is akin to switching off the power for your regular server. However, stopped instances will still 
be billed a small amount for the hard disk space retained. “Terminate” deletes all data associated 
with it. This includes the disk, hence you cannot start it again. Only do this if you know that you 
will not need it in the future. 


If you want to use the instance as a template for many more instances, right-click on the example 
in Fig. 19.3.9 and select “Image” — “Create” to create an image of the instance. Once this is com- 
plete, select “Instance State” — “Terminate” to terminate the instance. The next time you want 
to use this instance, you can follow the steps for creating and running an EC2 instance described 
in this section to create an instance based on the saved image. The only difference is that, in “1. 
Choose AMI” shown in Fig. 19.3.4, you must use the “My AMIs” option on the left to select your 
saved image. The created instance will retain the information stored on the image hard disk. For 
example, you will not have to reinstall CUDA and other runtime environments. 


Summary 
e You can launch and stop instances on demand without having to buy and build your own 
computer. 


e You need to install suitable GPU drivers before you can use them. 


Exercises 
1. The cloud offers convenience, but it does not come cheap. Find out how to launch spot 
instances? to see how to reduce prices. 
2. Experiment with different GPU servers. How fast are they? 
3. Experiment with multi-GPU servers. How well can you scale things up? 


Discussions? 


19.4 Using Google Colab 


We introduced how to run this book on AWS in Section 19.2 and Section 19.3. Another option is 
running this book on Google Colab*™, which provides free GPU if you have a Google account. 


To run a section on Colab, you can simply click the Colab button to the right of the title of that 
section, such as in Fig. 19.4.1. 





262 hitps://aws.amazon.com/ec2/spot/ 
263 https://discuss.d21.ai/t/423 
261 https://colab.research.google.com/ 
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2.1. Data Manipulation 





2.1. Data Manipulation 


Fig. 19.4.1: Open a section on Colab 


When it is the first time you execute a code cell, you will receive a warning message as shown in 
Fig. 19.4.2. You may click “RUN ANYWAY” to ignore it. 


Warning: This notebook was not authored ... 
This notebook is being loaded from GitHub. It may request 
access to your data stored with Google, or read data and 


credentials from other sessions. Please review the source code 
before executing this notebook. 


CANCEL | RUN ANYWAY 


Fig. 19.4.2: The warning message for running a section on Colab 


Next, Colab will connect you to an instance to run this notebook. Specifically, if GPU is needed, 
such as when invoking the d21.try_gpu() function, we will request Colab to connect to a GPU 
instance automatically. 


Summary 


e You can use Google Colab to run each section of this book with GPUs. 


Exercises 


1. Try to edit and run the code in this book using Google Colab. 


Discussions? 


19.5 Selecting Servers and GPUs 


Deep learning training generally requires large amounts of computation. At present GPUs are 
the most cost-effective hardware accelerators for deep learning. In particular, compared with 
CPUs, GPUs are cheaper and offer higher performance, often by over an order of magnitude. Fur- 
thermore, a single server can support multiple GPUs, up to 8 for high end servers. More typical 
numbers are up to 4 GPUs for an engineering workstation, since heat, cooling and power require- 
ments escalate quickly beyond what an office building can support. For larger deployments cloud 
computing, such as Amazon’s P37 and G4?” instances are a much more practical solution. 





265 https://discuss.d21.ai/t/424 
266 https: //aws.amazon.com/ec2/instance-types/p3/ 
267 https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/ 
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19.5.1 Selecting Servers 


There is typically no need to purchase high-end CPUs with many threads since much of the com- 
putation occurs on the GPUs. That said, due to the Global Interpreter Lock (GIL) in Python single- 
thread performance of a CPU can matter in situations where we have 4-8 GPUs. All things equal 
this suggests that CPUs with a smaller number of cores but a higher clock frequency might be a 
more economical choice. E.g., when choosing between a 6-core 4 GHz and an 8-core 3.5 GHz CPU, 
the former is much preferable, even though its aggregate speed is less. An important considera- 
tion is that GPUs use lots of power and thus dissipate lots of heat. This requires very good cooling 
and a large enough chassis to use the GPUs. Follow the guidelines below if possible: 


1. Power Supply. GPUs use significant amounts of power. Budget with up to 350W per device 


(check for the peak demand of the graphics card rather than typical demand, since efficient 
code can use lots of energy). If your power supply is not up to the demand you will find that 
your system becomes unstable. 


. Chassis Size. GPUs are large and the auxiliary power connectors often need extra space. 


Also, large chassis are easier to cool. 


. GPU Cooling. If you have large numbers of GPUs you might want to invest in water cooling. 


Also, aim for reference designs even if they have fewer fans, since they are thin enough to 
allow for air intake between the devices. If you buy a multi-fan GPU it might be too thick to 
get enough air when installing multiple GPUs and you will run into thermal throttling. 


. PCIe Slots. Moving data to and from the GPU (and exchanging it between GPUs) requires 


lots of bandwidth. We recommend PCIe 3.0 slots with 16 lanes. If you mount multiple GPUs, 
be sure to carefully read the motherboard description to ensure that 16x bandwidth is still 
available when multiple GPUs are used at the same time and that you are getting PCIe 3.0 as 
opposed to PCIe 2.0 for the additional slots. Some motherboards downgrade to 8x or even 
4x bandwidth with multiple GPUs installed. This is partly due to the number of PCIe lanes 
that the CPU offers. 


In short, here are some recommendations for building a deep learning server: 


Beginner. Buy a low end GPU with low power consumption (cheap gaming GPUs suitable 
for deep learning use 150-200W). If you are lucky your current computer will support it. 


1 GPU. A low-end CPU with 4 cores will be plenty sufficient and most motherboards suffice. 
Aim for at least 32 GB DRAM and invest into an SSD for local data access. A power supply 
with 600W should be sufficient. Buy a GPU with lots of fans. 


2 GPUs. A low-end CPU with 4-6 cores will suffice. Aim for 64 GB DRAM and invest into an 
SSD. You will need in the order of 1000W for two high-end GPUs. In terms of mainboards, 
make sure that they have two PCIe 3.0 x16 slots. If you can, get a mainboard that has two 
free spaces (60mm spacing) between the PCIe 3.0 x16 slots for extra air. In this case, buy two 
GPUs with lots of fans. 


4 GPUs. Make sure that you buy a CPU with relatively fast single-thread speed (i.e., high 
clock frequency). You will probably need a CPU with a larger number of PCIe lanes, such 
as an AMD Threadripper. You will likely need relatively expensive mainboards to get 4 PCIe 
3.0 x16 slots since they probably need a PLX to multiplex the PCIe lanes. Buy GPUs with 
reference design that are narrow and let air in between the GPUs. You need a 1600-2000W 
power supply and the outlet in your office might not support that. This server will probably 
run loud and hot. You do not want it under your desk. 128 GB of DRAM is recommended. Get 
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an SSD (1-2 TB NVMe) for local storage and a bunch of hard disks in RAID configuration to 
store your data. 


8 GPUs. You need to buy a dedicated multi-GPU server chassis with multiple redundant 
power supplies (e.g., 2+1 for 1600W per power supply). This will require dual socket server 
CPUs, 256 GB ECC DRAM, a fast network card (10 GBE recommended), and you will need to 
check whether the servers support the physical form factor of the GPUs. Airflow and wiring 
placement differ significantly between consumer and server GPUs (e.g., RTX 2080 vs. Tesla 
V100). This means that you might not be able to install the consumer GPU in a server due to 
insufficient clearance for the power cable or lack of a suitable wiring harness (as one of the 
coauthors painfully discovered). 


19.5.2 Selecting GPUs 


At present, AMD and NVIDIA are the two main manufacturers of dedicated GPUs. NVIDIA was the 
first to enter the deep learning field and provides better support for deep learning frameworks via 
CUDA. Therefore, most buyers choose NVIDIA GPUs. 


NVIDIA provides two types of GPUs, targeting individual users (e.g., via the GTX and RTX series) 
and enterprise users (via its Tesla series). The two types of GPUs provide comparable compute 
power. However, the enterprise user GPUs generally use (passive) forced cooling, more memory, 
and ECC (error correcting) memory. These GPUs are more suitable for data centers and usually 
cost ten times more than consumer GPUs. 


If you are a large company with 100+ servers you should consider the NVIDIA Tesla series or alter- 
natively use GPU servers in the cloud. For a lab or a small to medium company with 10+ servers 
the NVIDIA RTX series is likely most cost effective. You can buy preconfigured servers with Su- 
permicro or Asus chassis that hold 4-8 GPUs efficiently. 


GPU vendors typically release a new generation every 1-2 years, such as the GTX 1000 (Pascal) 
series released in 2017 and the RTX 2000 (Turing) series released in 2019. Each series offers sev- 
eral different models that provide different performance levels. GPU performance is primarily a 
combination of the following three parameters: 


1. Compute power. Generally we look for 32-bit floating-point compute power. 16-bit floating 
point training (FP16) is also entering the mainstream. If you are only interested in predic- 
tion, you can also use 8-bit integer. The latest generation of Turing GPUs offers 4-bit ac- 
celeration. Unfortunately at present the algorithms to train low-precision networks are not 
widespread yet. 


2. Memory size. As your models become larger or the batches used during training grow 
bigger, you will need more GPU memory. Check for HBM2 (High Bandwidth Memory) 
vs. GDDR6 (Graphics DDR) memory. HBM2 is faster but much more expensive. 


3. Memory bandwidth. You can only get the most out of your compute power when you have 
sufficient memory bandwidth. Look for wide memory buses if using GDDR6. 


For most users, it is enough to look at compute power. Note that many GPUs offer different types of 
acceleration. E.g., NVIDIAS TensorCores accelerate a subset of operators by 5x. Ensure that your 
libraries support this. The GPU memory should be no less than 4 GB (8 GB is much better). Try 
to avoid using the GPU also for displaying a GUI (use the built-in graphics instead). If you cannot 
avoid it, add an extra 2 GB of RAM for safety. 


Fig. 19.5.1 compares the 32-bit floating-point compute power and price of the various GTX 900, 
GTX 1000 and RTX 2000 series models. The prices are the suggested prices found on Wikipedia. 
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Fig. 19.5.1: Floating-point compute power and price comparison. 


We can see a number of things: 


1 


3. 


Within each series, price and performance are roughly proportional. Titan models com- 
mand a significant premium for the benefit of larger amounts of GPU memory. However, 
the newer models offer better cost effectiveness, as can be seen by comparing the 980 Ti 
and 1080 Ti. The price does not appear to improve much for the RTX 2000 series. However, 
this is due to the fact that they offer far superior low precision performance (FP16, INT8 and 
INT4). 


. The performance-to-cost ratio of the GTX 1000 series is about two times greater than the 900 


series. 


For the RTX 2000 series the price is an affine function of the price. 
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Fig. 19.5.2: Floating-point compute power and energy consumption. 


Fig. 19.5.2 shows how energy consumption scales mostly linearly with the amount of computa- 
tion. Second, later generations are more efficient. This seems to be contradicted by the graph 
corresponding to the RTX 2000 series. However, this is a consequence of the TensorCores which 
draw disproportionately much energy. 


Summary 


Watch out for power, PCIe bus lanes, CPU single thread speed and cooling when building a 
server. 


You should purchase the latest GPU generation if possible. 
Use the cloud for large deployments. 


High density servers may not be compatible with all GPUs. Check the mechanical and cool- 
ing specifications before you buy. 


Use FP16 or lower precision for high efficiency. 


Discussions?8 
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19.6 Contributing to This Book 


Contributions by readers? help us improve this book. If you find a typo, an outdated link, some- 
thing where you think we missed a citation, where the code does not look elegant or where an ex- 
planation is unclear, please contribute back and help us help our readers. While in regular books 
the delay between print runs (and thus between typo corrections) can be measured in years, ittyp- 
ically takes hours to days to incorporate an improvement in this book. This is all possible due to 
version control and continuous integration testing. To do so you need to submit a pull request?” to 
the GitHub repository. When your pull request is merged into the code repository by the author, 
you will become a contributor. 


19.6.1 Minor Text Changes 


The most common contributions are editing one sentence or fixing typos. We recommend you 
to find the source file in the github repo”! and edit the file directly. For example, you can search 
the file through the Find file?”? button (Fig. 19.6.1) to locate the source file, which is a markdown 
file. Then you click the “Edit this file” button on the top-right corner to make your changes in the 
markdown file. 


j.md Find file Copy path 


cc78f74 on Jan 14 


Edit this file 
AR a 
Raw Blame History =) w 


Fig. 19.6.1: Edit the file on Github. 


After you are done, fill in your change descriptions in the “Propose file change” panel on the page 
bottom and then click the “Propose file change” button. It will redirect you to a new page to review 
your changes (Fig. 19.6.7). If everything is good, you can submit a pull request by clicking the 
“Create pull request” button. 


19.6.2 Propose a Major Change 


If you plan to update a large portion of text or code, then you need to know a little bit more about 
the format this book is using. The source file is based on the markdown format?”* with a set of 
extensions through the d2lbook*”* package such as referring to equations, images, chapters, and 
citations. You can use any Markdown editors to open these files and make your changes. 





22 https://github.com/d21-ai/d21-en/graphs/contributors 
2 https://github.com/d21-ai/d21-en/pulls 

27! https://github.com/d2l-ai/d2l-en 

2 https://github.com/d21-ai/d21-en/find/master 

23 https://daringfireball.net/projects/markdown/syntax 
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If you would like to change the code, we recommend you to use Jupyter to open these Markdown 
files as described in Section 19.1. So that you can run and test your changes. Please remember 
to clear all outputs before submitting your changes, our CI system will execute the sections you 
updated to generate outputs. 


Some sections may support multiple framework implementations, you can use d21book to activate 
a particular framework, so other framework implementations become Markdown code blocks and 
will not be executed when you “Run All” in Jupyter. In other words, first install d21book by running 


pip install git+https://github.com/d21-ai/d21-book 


Then in the root directory of d21-en, you can activate a particular implementation by running one 
of the following commands: 


d2lbook activate mxnet chapter_multilayer-perceptrons/mlp-scratch.md 


d2lbook activate pytorch chapter_multilayer-perceptrons/mlp-scratch.md 
d2lbook activate tensorflow chapter_multilayer-perceptrons/mlp-scratch.md 


Before submitting your changes, please clear all code block outputs and activate all by 


d2lbook activate all chapter_multilayer-perceptrons/mlp-scratch.md 


If you add a new code block not for the default implementation, which is MXNet, please use #@tab 
to mark this block on the beginning line. For example, #@tab pytorch for a PyTorch code block, 
#@tab tensorflow for a TensorFlow code block, or #@tab all a shared code block for all imple- 
mentations. You may refer to d2lbook?”? for more information. 


19.6.3 Adding a New Section or a New Framework Implementation 
If you want to create a new chapter, e.g. reinforcement learning, or add implementations of new 


frameworks, such as TensorFlow, please contact the authors first, either by emailing or using 
github issues”, 


19.6.4 Submitting a Major Change 


We suggest you to use the standard git process to submita major change. In a nutshell the process 
works as described in Fig. 19.6.2. 


merge pull push 
request 


GitHub fork GitHub local copy im 
d2|-ai/d2l-en user/d2l-en d2l-en 


Fig. 19.6.2: Contributing to the book. 
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We will walk you through the steps in detail. If you are already familiar with Git you can skip this 
section. For concreteness we assume that the contributor's user name is “astonzhang”. 


Installing Git 
The Git open source book describes how to install Git?’’. This typically works via apt install git 


on Ubuntu Linux, by installing the Xcode developer tools on macOS, or by using GitHub’s desktop 
client?”8, If you do not have a GitHub account, you need to sign up for one. 


Logging in to GitHub 
Enter the address?”? of the book’s code repository in your browser. Click on the Fork button in the 


red box at the top-right of Fig. 19.6.3, to make a copy of the repository of this book. This is now 
your copy and you can change it any way you want. 


d2l-ai / d2l-en Musedby+ 4 O Unwatchy 139 K Unstar 2.9k Y Fork 807 





<> Code Issues 3 Pull requests 21 Actions Security Insights Settings 


Fig. 19.6.3: The code repository page. 


Now, the code repository of this book will be forked (i.e., copied) to your username, such as 
astonzhang/d21-en shown at the top-left of the screenshot Fig. 19.6.4. 


astonzhang / d2l-en Owatchy 0 K Star 0 Y Fork 807 
forked from d21-ai/d21-en 





<> Code Pull requests 0 Actions Projects 0 Security Insights Settings 


Fig. 19.6.4: Fork the code repository. 


Cloning the Repository 


To clone the repository (i.e., to make a local copy) we need to get its repository address. The 
green button in Fig. 19.6.5 displays this. Make sure that your local copy is up to date with the main 
repository if you decide to keep this fork around for longer. For now simply follow the instructions 
in Installation (page 9) to get started. The main difference is that you are now downloading your 
own fork of the repository. 





277 https://git-scm.com/book/en/v2 
28 https://desktop.github.com 
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Branch: master y New pull request Create new file Upload files Find file Clone or download y 
This branch is even with d21-ai:master. Clone with SSH © Use HTTPS 


Use a password protected SSH key. 


B astonzhang Update README.md 





E chapter_appendix Fix broken hyperlink (d21-ai++359) 


Fig. 19.6.5: Git clone. 


# Replace your_github_username with your GitHub username 
git clone https: //github.com/your_github_username/d21-en. git 


Editing the Book and Push 


Now it is time to edit the book. It is best to edit the notebooks in Jupyter following instructions 
in Section 19.1. Make the changes and check that they are OK. Assume we have modified a typo 
in the file ~/d21-en/chapter_appendix_tools/how-to-contribute.md. You can then check which 
files you have changed: 


At this point Git will prompt that the chapter_appendix_tools/how-to-contribute.md file has been 
modified. 


mylaptop:d21-en me$ git status 
On branch master 
Your branch is up-to-date with 'origin/master'. 


Changes not staged for commit: 
(use "git add <file>...” to update what will be committed) 


(use "git checkout -- <file>..." to discard changes in working directory) 


modified: chapter_appendix_tools/how-to-contribute.md 


After confirming that this is what you want, execute the following command: 


git add chapter_appendix_tools/how-to-contribute.md 
git commit -m 'fix typo in git documentation’ 
git push 


The changed code will then be in your personal fork of the repository. To request the addition of 
your change, you have to create a pull request for the official repository of the book. 
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Pull Request 


As shown in Fig. 19.6.6, go to your fork of the repository on GitHub and select “New pull request”. 
This will open up a screen that shows you the changes between your edits and what is current in 
the main repository of the book. 


© 1,219 commits P 5 branches 0 packages O 2 releases 42 52 contributors af View license 





Branch: master v New pull request Create new file Upload files Find file Clone or download + 


This branch is 1 commit ahead of d2I-ai:master. 1 Pull request Compare 


Fig. 19.6.6: Pull Request. 


Submitting Pull Request 


Finally, submit a pull request by clicking the button as shown in Fig. 19.6.7. Make sure to describe 
the changes you have made in the pull request. This will make it easier for the authors to review 
it and to merge it with the book. Depending on the changes, this might get accepted right away, 
rejected, or more likely, you will get some feedback on the changes. Once you have incorporated 
them, you are good to go. 


Comparing changes 


Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks. 


u base repository: d21-ai/d2l-en y base: master» head repository: astonzhang/d2l-en + compare: master y 


v Able to merge. These branches can be automatically merged. 


MACS MIC tam | Discuss and review the changes in this comparison with others. © 


© 1 commit 1) 1 file changed [0 commit comments 42 1 contributor 
Fig. 19.6.7: Create Pull Request. 


Your pull request will appear among the list of requests in the main repository. We will make every 
effort to process it quickly. 


Summary 


You can use GitHub to contribute to this book. 


You can edit the file on GitHub directly for minor changes. 


For a major change, please fork the repository, edit things locally and only contribute back 
once you are ready. 


Pull requests are how contributions are being bundled up. Try not to submit huge pull 
requests since this makes them hard to understand and incorporate. Better send several 
smaller ones. 
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Exercises 


. Star and fork the d21-en repository. 
. Find some code that needs improvement and submit a pull request. 


. Find a reference that we missed and submit a pull request. 


BR ù N Be 


. Itis usually a better practice to create a pull request using a new branch. Learn how to do it 
with Git branching?*, 


Discussions?*! 


19.7 d21 API Document 


The implementations of the following members of the d21 package and sections where they are 
defined and explained can be found in the source file?*?, 


class d21.mxnet.Accumulator(n) 
For accumulating sums over n variables. 


class d21.mxnet.AddNorm(dropout, **kwargs) 


forward(X, Y) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.AdditiveAttention(num_hiddens, dropout, **kwargs) 
Additive attention. 


forward(quertes, keys, values, valid_lens) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.Animator (xlabel=None, ylabel=None, legend=None, xlim=None, ylim=None, 
xscale='linear', yscale='linear', fmts='-', 'm--', 'g-., 'r:', nrows=1, 
ncols=1, figsize=3.5, 2.5) 


For plotting data in animation. 


class d21.mxnet.AttentionDecoder (**kwargs) 
The base attention-based decoder interface. 


class d21.mxnet.BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout, max_len=1000, **kwargs) 


forward(tokens, segments, valid_lens) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 





250 hitps://git-scm.com/book/en/v2/Git- Branching-Branches-in-a-Nutshell 
21 https://discuss.d21.ai/t/426 
282 https://github.com/d21-ai/d21-en/tree/master/d21 
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*args [list of NDArray] Input tensors. 


class d21.mxnet.BERTModel(vocab_size, num_hiddens, ffn_num_hiddens, num_heads, 
num_layers, dropout, max_len=1000) 


forward(tokens, segments, valid_lens=None, pred_positions=None) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.BPRLoss(weight=None, batch_axis=0, **kwargs) 


forward (positive, negative) 
Defines the forward computation. Arguments can be either NDArray or Symbol. 


class d21.mxnet.BananasDataset (is_train) 


class d21.mxnet.CTRDataset (data_path, feat_mapper=None, defaults=None, 
min_threshold=4, num_feat=34) 
class d21.mxnet .Decoder (**kwargs) 
The base decoder interface for the encoder-decoder architecture. 


forward (X, state) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.DotProductAttention(dropout, **kwargs) 
Scaled dot product attention. 


forward(quertes, keys, values, valid_lens=None) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.Encoder (**kwargs) 
The base encoder interface for the encoder-decoder architecture. 


forward(X, *args) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.EncoderBlock(num_hiddens, ffn_num_hiddens, num_heads, dropout, 
use_bias=False, **kwargs) 


forward(X, valid_lens) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.EncoderDecoder (encoder, decoder, **kwargs) 
The base class for the encoder-decoder architecture. 
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forward(enc_X, dec_X, *args) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.HingeLossbRec(weight=None, batch_axis=0, **kwargs) 


forward (positive, negative, margin=1) 
Defines the forward computation. Arguments can be either NDArray or Symbol. 


class d21.mxnet .MaskLM(vocab_size, num_hiddens, **kwargs) 


forward(X, pred_positions) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet .MaskedSoftmaxCELoss(axis=- 1, sparse_label=True, from_logits=False, 
weight=None, batch_axis=0, **kwargs) 
The softmax cross-entropy loss with masks. 


forward(pred, label, valid_len) 
Defines the forward computation. Arguments can be either NDArray or Symbol. 


class d21.mxnet.MultiHeadAttention(num_hiddens, num_heads, dropout, use_bias=False, 
**kwargs) 


forward (queries, keys, values, valid_lens) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.NextSentencePred(**kwargs) 


forward(X) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.PositionWiseFFN(ffn_num_hiddens, ffn_num_outputs, **kwargs) 


forward(X) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.PositionalEncoding(num_hiddens, dropout, max_len=1000) 
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forward(X) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.RNNModel (rnn_layer, vocab_size, **kwargs) 
The RNN model. 


forward(inputs, state) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.RNNModelScratch(vocab_size, num_hiddens, device, get_params, init_state, 


forward_fn) 
An RNN Model implemented from scratch. 


class d21.mxnet.RandomGenerator (sampling_weights) 
Draw a random int in [0, n] according to n sampling weights. 


class d21.mxnet.Residual (num_channels, use_1x1conv=False, strides=1, **kwargs) 
The Residual block of ResNet. 


forward(X) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.SNLIDataset (dataset, num_steps, vocab=None) 
A customized dataset to load the SNLI dataset. 


class d21.mxnet.Seq2SeqEncoder (vocab_size, embed_size, num_hiddens, num_layers, 
dropout=0, **kwargs) 
The RNN encoder for sequence to sequence learning. 
forward(X, *args) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.SeqDataLoader (batch_size, num_steps, use_random_iter, max_tokens) 
An iterator to load sequence data. 


class d21.mxnet. Timer 
Record multiple running times. 


avg() 
Return the average time. 


cumsum() 
Return the accumulated time. 


startQ 
Start the timer. 


stop() 
Stop the timer and record the time in a list. 
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sum() 
Return the sum of time. 


class d21.mxnet . TokenEmbedding (embedding_name) 


Token Embedding. 
class d21.mxnet.TransformerEncoder (vocab_size, num_hiddens, ffn_num_hiddens, 
num_heads, num_layers, dropout, use_bias=False, 
**kwargs) 


forward(X, valid_lens, *args) 
Overrides to implement forward computation using NDArray. Only accepts positional 
arguments. 


*args [list of NDArray] Input tensors. 


class d21.mxnet.VOCSegDataset (is_train, crop_size, voc_dir) 
A customized dataset to load VOC dataset. 


filter (imgs) 
Returns a new dataset with samples filtered by the filter function fn. 


Note that if the Dataset is the result of a lazily transformed one with trans- 
form(lazy=False), the filter is eagerly applied to the transformed samples without ma- 
terializing the transformed result. That is, the transformation will be applied again 
whenever a sample is retrieved after filter(). 


fn [callable] A filter function that takes a sample as input and returns a boolean. Sam- 
ples that return False are discarded. 


Dataset The filtered dataset. 


class d21.mxnet.Vocab(tokens=None, min_freg=0, reserved_tokens=None) 
Vocabulary for text. 


d21.mxnet.abs(x, out=None, **kwargs) 
Calculate the absolute value element-wise. 


x [ndarray or scalar] Input array. 


out [ndarray or None, optional] A location into which the result is stored. If provided, it 
must have a shape that the inputs broadcast to. If not provided or None, a freshly- 
allocated array is returned. 


absolute [ndarray] An ndarray containing the absolute value of each element in x. Thisisa 
scalar if xis a scalar. 


>>> x = np.array([-1.2, 1.2]) 
>>> np.abs(x) 
array([1.2, 1.2]) 


d21.mxnet. accuracy (y_hat, y) 
Compute the number of correct predictions. 


d21.mxnet. arange(start, stop=None, step=1, dtype=None, ctx=None) 
Return evenly spaced values within a given interval. 
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d21 


d21. 


d21. 


Values are generated within the half-open interval [start, stop) (in other words, the inter- 
val including start but excluding stop). For integer arguments the function is equivalent to 
the Python built-in range function, but returns an ndarray rather than a list. 


start [number, optional] Start of interval. The interval includes this value. The default start 
value is 0. 


stop [number] End of interval. The interval does not include this value, except in some cases 
where step is not an integer and floating point round-off affects the length of out. 


step [number, optional] Spacing between values. For any output out, this is the distance 
between two adjacent values, out[i+1] - out[il. The default step size is 1. If step is 
specified as a position argument, start must also be given. 


dtype [dtype] The type of the output array. The default is float32. 


arange [ndarray] Array of evenly spaced values. 


For floating point arguments, the length of the result is ceil((stop - start)/step). 
Because of floating point overflow, this rule may result in the last element of out being 
greater than stop. 


>>> np.arange(3) 
arre... tag 201) 


>>> np.arange(3.0) 
ICEN EO 5 Mos 201) 


>>> np.arange(3,7) 
arre (Ede. Les Des Bol 


>>> np.arange(3,7,2) 
arras. Be) 


.mxnet .bbox_to_rect(bbox, color) 


Convert bounding box to matplotlib format. 


mxnet .bleu(pred_seq, label_seq, k) 
Compute the BLEU. 


mxnet.box_center_to_corner(boxes) 
Convert from (center, width, height) to (upper_left, bottom_right) 


mxnet.box_corner_to_center(boxes) 
Convert from (upper_left, bottom_right) to (center, width, height) 


.mxnet .box_iou(boxes1, boxes2) 


Compute IOU between two sets of boxes of shape (N,4) and (M,4). 


mxnet.build_array_nmt (lines, vocab, num_steps) 
Transform text sequences of machine translation into minibatches. 


mxnet.build_colormap2label (>) 
Build an RGB color to label mapping for segmentation. 
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d21.mxnet.concat (seq, axis=0, out=None) 
Join a sequence of arrays along an existing axis. 


al, a2,... [sequence of array_like] The arrays must have the same shape, except in the di- 
mension corresponding to axis (the first, by default). 


axis [int, optional] The axis along which the arrays will be joined. If axis is None, arrays are 
flattened before use. Default is 0. 


out [ndarray, optional] If provided, the destination to place the result. The shape must be 
correct, matching that of what concatenate would have returned if no out argument 
were specified. 


res [ndarray] The concatenated array. 


split : Split array into a list of multiple sub-arrays of equal size. hsplit : Split array into 
multiple sub-arrays horizontally (column wise) vsplit : Split array into multiple sub-arrays 
vertically (row wise) dsplit : Split array into multiple sub-arrays along the 3rd axis (depth). 
stack : Stack a sequence of arrays along a new axis. hstack : Stack arrays in sequence hori- 
zontally (column wise) vstack : Stack arrays in sequence vertically (row wise) dstack : Stack 
arrays in sequence depth wise (along third dimension) 


>>> a = np.array([[1, 2], [3, 411) 
>>> b = np.array([[5, 6]]1) 
>>> np.concatenate((a, b), axis=0) 
arre CELL... Bol, 

Essa 4l 

ES... 6.11) 


>>> np.concatenate((a, b.T), axis=1) 
arre... Zon Doll, 
Edo. os Boll) 


>>> np.concatenate((a, b), axis=None) 
Are Lic, Los Dos Son Boy Bal 


d21.mxnet.copyfile(filename, target_dir) 
Copy a file into a target directory. 


d21.mxnet.corr2d(X, K) 
Compute 2D cross-correlation. 


d21.mxnet.cos(x, out=None, **kwargs) 
Cosine, element-wise. 


x [ndarray or scalar] Angle, in radians (27 rad equals 360 degrees). 


out [ndarray or None] A location into which the result is stored. If provided, it must have 
a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array 
is returned. The dtype of the output is the same as that of the input if the input is an 
ndarray. 


y [ndarray or scalar] The corresponding cosine values. This is a scalar if x is a scalar. 
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This function only supports input type of float. 


>>> np.cos(np.array([0, np.pi/2, np.pi]J)) 

array([ 1.000000e+00, -4.371139e-08, -1.000000e+00]) 
>>> # Example of providing the optional output parameter 
>>> outl = np.array([0], dtype='f') 

>>> out2 = np.cos(np.array([0.1]), out1) 

>>> out2 is outl 

True 


mxnet.cosh(x, out=None, **kwargs) 
Hyperbolic cosine, element-wise. Equivalent to 1/2 * (np.exp(x) + np.exp(-x)) and np. 
cos(1j3xx). 


x [ndarray or scalar] Input array or scalar. 


out [ndarray or None] A location into which the result is stored. If provided, it must have 
a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array 
is returned. The dtype of the output is the same as that of the input if the input is an 
ndarray. 


y [ndarray or scalar] The corresponding hyperbolic cosine values. This is a scalar if x isa 
scalar. 


This function only supports input type of float. 


>>> np.cosh(0) 
1.0 


mxnet.count_corpus (tokens) 
Count token frequencies. 


class d21.mxnet.defaultdict 


d21 


d21. 


d21. 


d21. 


defaultdict(default_factoryl, ...]) -> dict with default factory 


The default factory is called without arguments to produce a new value when a key is not 
present, in __getitem__ only. A defaultdict compares equal to a dict with the same items. 
All remaining arguments are treated the same as if they were passed to the dict constructor, 
including keyword arguments. 


copy() > a shallow copy of D. 


default_factory 
Factory for default value called by __missing__(). 


.mxnet.download(name, cache_dir='../data') 


Download a file inserted into DATA_HUB, return the local filename. 


mxnet.download_al1l() 
Download all files in the DATA_HUB. 


mxnet . download_extract (name, folder=None) 
Download and extract a zip/tar file. 


mxnet.evaluate_accuracy (net, data_iter) 
Compute the accuracy for a model on a dataset. 
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d21.mxnet.evaluate_accuracy_gpu(net, data_iter, device=None) 
Compute the accuracy for a model on a dataset using a GPU. 


d21.mxnet.evaluate_loss(net, data_iter, loss) 
Evaluate the loss of a model on the given dataset. 


d21.mxnet.exp(x, out=None, **kwargs) 
Calculate the exponential of all elements in the input array. 


x [ndarray or scalar] Input values. 


out [ndarray or None, optional] A location into which the result is stored. If provided, it 
must have a shape that the inputs broadcast to. If not provided or None, a freshly- 
allocated array is returned. 


out [ndarray or scalar] Output array, element-wise exponential of x. This is a scalar if x is a 
scalar. 


>>> np.exp(1) 

2.718281828459045 

>>> x = np.array([-1, 1, -2, 2]) 

>>> np.exp(x) 

array(L0. 36787945, 2.7182817 , 0.13533528, 7.389056 ]) 


d21.mxnet.eye(N, M=None, k=0, dtype=<class 'numpy.float32'>, **kwargs) 
Return a 2-D array with ones on the diagonal and zeros elsewhere. 


N [int] Number of rows in the output. 
M [int, optional] Number of columns in the output. If None, defaults to N. 


k [int, optional] Index of the diagonal: 0 (the default) refers to the main diagonal, a positive 
value refers to an upper diagonal, and a negative value to a lower diagonal. 


dtype [data-type, optional] Data-type of the returned array. 


I [ndarray of shape (N,M)] An array where all elements are equal to zero, except for the k-th 
diagonal, whose values are equal to one. 


>>> np.eye(2, dtype=int) 
array([[1, 01, 

[0, 11], dtype=int64) 
>>> np.eye(3, k=1) 
anray Hone Mo. Doll, 

Ey Oep Tell 

de. Aog O61 


class d2l.mxnet.float32 
Single-precision floating-point number type, compatible with C float. Character code: 'f”. 
Canonical name: np.single. Alias on this platform: np.float32: 32-bit-precision floating- 
point number type: sign bit, 8 bits exponent, 23 bits mantissa. 


as_integer_ratio() 
Return a pair of integers, whose ratio is exactly equal to the original floating point num- 
ber, and with a positive denominator. Raise OverflowError on infinities and a ValueEr- 
ror on NaNs. 
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>>> np.single(10.0).as_integer_ratio() 


(CUORE) 
>>> np.single(0.0).as_integer_ratio() 
@, 19 
>>> np.single(-.25).as_integer_ratio() 
(i, 4) 


d21.mxnet.get_dataloader_workers() 
Use 4 processes to read the data except for Windows. 


d21.mxnet.get_fashion_mnist_labels (labels) 
Return text labels for the Fashion-MNIST dataset. 


d21.mxnet.grad_clipping (net, theta) 
Clip the gradient. 


class d21.mxnet.int32 
Signed integer type, compatible with C int. Character code: 'i'. Canonical name: np. intc. 
Alias on this platform: np.int32: 32-bit signed integer (-2147483648 to 2147483647). 


! 


d21.mxnet.linreg(X, w, b) 
The linear regression model. 


d21.mxnet.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0, 


ctx=None) 
Return evenly spaced numbers over a specified interval. 


Returns num evenly spaced samples, calculated over the interval [start, stop]. The endpoint 
of the interval can optionally be excluded. 


start [real number] The starting value of the sequence. 


stop [real number] The end value of the sequence, unless endpoint is set to False. In that 
case, the sequence consists of all but the last of num + 1 evenly spaced samples, so that 
stop is excluded. Note that the step size changes when endpoint is False. 


num [int, optional] Number of samples to generate. Default is 50. Must be non-negative. 


endpoint [bool, optional] If True, stop is the last sample. Otherwise, it is not included. De- 
fault is True. 


retstep [bool, optional] If True, return (samples, step), where step is the spacing between 
samples. 


dtype [dtype, optional] The type of the output array. If dtype is not given, infer the data type 
from the other input arguments. 


axis [int, optional] The axis in the result to store the samples. Relevant only if start or stop 
are array-like. By default (0), the samples will be along a new axis inserted at the be- 
ginning. Use -1 to get an axis at the end. 


samples [ndarray] There are num equally spaced samples in the closed interval [start, stop] 
or the half-open interval [start, stop) (depending on whether endpoint is True or False). 


step [float, optional] Only returned if retstep is True Size of spacing between samples. 


arange [Similar to linspace, but uses a step size (instead of the] number of samples). 
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>>> np.linspace(2.0, 3.0, num=5) 

anran (P 5 2028), 25 . 275, Do 1) 

>>> np.linspace(2.0, 3.0, num=5, endpoint=False) 
APPEL. 5 22, Pollo 2.8, ZBI) 

>>> np.linspace(2.0, 3.0, num=5, retstep=True) 
(RA A , 275, Be A DAL) 


Graphical illustration: 


>>> import matplotlib.pyplot as plt 

>>N=8 

>>> y = np.zeros(N) 

>>> x1 = np.linspace(0, 10, N, endpoint=True) 
>>> x2 = np.linspace(@, 10, N, endpoint=False) 
>>> plt.plot(xl.asnumpy(), y.asnumpy(), 'o') 
[<matplotlib.lines.Line2D object at 0x...>] 
>>> plt.plot(x2.asnumpy(), (y + @.5).asnumpy(), 'o') 
[<matplotlib.lines.Line2D object at 0x...>] 
>>> plt.ylim([-0.5, 17) 

(0.5, 1) 

>>> plt.show() 


This function differs from the original numpy.linspace?** in the following aspects: 
e start and stop do not support list, numpy ndarray and mxnet ndarray 
e axis could only be 0 


* There could be an additional ctx argument to specify the device, e.g. the i-th GPU. 


mxnet.load_array(data_arrays, batch_size, is_train=True) 
Construct a Gluon data iterator. 


mxnet.load_corpus_time_machine(max_tokens=- 1) 
Return token indices and the vocabulary of the time machine dataset. 


mxnet.load_data_bananas(batch_size) 
Load the bananas dataset. 


mxnet.load_data_fashion_mnist(batch_size, resize=None) 
Download the Fashion-MNIST dataset and then load it into memory. 


mxnet. load_data_nmt (batch_size, num_steps, num_examples=600) 
Return the iterator and the vocabularies of the translation dataset. 


mxnet . load_data_snli (batch_size, num_steps=50) 
Download the SNLI dataset and return data iterators and vocabulary. 
mxnet.load_data_time_machine(batch_size, num_steps, use_random_iter=False, 
max_tokens=10000) 
Return the iterator and the vocabulary of the time machine dataset. 
mxnet . load_data_voc(batch_size, crop_size) 


Download and load the VOC2012 semantic dataset. 





283 https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html 
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d21.mxnet.log(x, out=None, **kwargs) 
Natural logarithm, element-wise. The natural logarithm log is the inverse of the exponential 
function, so that log(exp(x)) = x. The natural logarithm is logarithm in base e. 


x [ndarray] Input value. Elements must be of real value. 


out [ndarray or None, optional] A location into which the result is stored. If provided, it 
must have the same shape and dtype as input ndarray. If not provided or None, a freshly- 
allocated array is returned. 


y [ndarray] The natural logarithm of x, element-wise. This is a scalar if x is a scalar. 


Currently only supports data of real values and inf as input. Returns data of real value, inf, 
-inf and nan according to the input. This function differs from the original numpy.log** in 
the following aspects: - Does not support complex number for now - Input type does not sup- 
port Python native iterables(list, tuple, ...). - out param: cannot perform auto broadcasting. 
out ndarray’s shape must be the same as the expected output. - out param: cannot perform 
auto type cast. out ndarray’s dtype must be the same as the expected output. - out param 
does not support scalar input case. 


>>> a = np.array([1, np.exp(1), np.exp(2), 0], dtype=np.float64) 

>>> np.log(a) 

array([ 0., 1, 2., -inf], dtype=float64) 

>>> # Using the default float32 dtype leads to slightly different behavior 
>>> a = np.array([1, np.exp(1), np.exp(2), 2]) 

>>> np.log(a) 

array([ 0., @.99999994, 2., -inf]) 

>>> np.log(1) 

0.0 


d21.mxnet.masked_softmax(X, valid_lens) 
Perform softmax operation by masking elements on the last axis. 


d21.mxnet.match_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5) 
Assign ground-truth bounding boxes to anchor boxes similar to them. 


d21.mxnet.matmul (a, b, out=None) 
Dot product of two arrays. Specifically, 


If both a and bare 1-D arrays, it is inner product of vectors 


If both a and b are 2-D arrays, it is matrix multiplication, 


If either a or b is 0-D (scalar), it is equivalent to multiply() and using np.multiply(a, 
b) ora * bis preferred. 


If ais an N-D array and bis a 1-D array, it is a sum product over the last axis of a and b. 


If ais an N-D array and bis a 2-D array, it is a sum product over the last axis of a and the 
second-to-last axis of b: 


dot(a, b)[i,j,k] = sum(aLli,j,:] * bL:,k]) 


a [ndarray] First argument. 





284 https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html 
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b [ndarray] Second argument. 


out [ndarray, optional] Output argument. It must have the same shape and type as the ex- 
pected output. 


output [ndarray] Returns the dot product of a and b. If a and b are both scalars or both 1-D 
arrays then a scalar is returned; otherwise an array is returned. If out is given, then it 
is returned 


>>> a = np.array(3) 
>>> b = np.array(4) 
>>> np.dot(a, b) 
array(12.) 


For 2-D arrays it is the matrix product: 


>>> a = np.array([[1, 0], [o, 111) 
>>> b = np.array([[4, 1], [2, 21D 
>>> np.dot(a, b) 
array(L[4., 1.1, 

[2s 2.11) 


>>> a = np.arange(3x4x5x6).reshape((3,4,5,6)) 
>>> b = np.arange(5x6)[::-1].reshape((6,5)) 
>>> np.dot(a, b)[2,3,2,2] 

array(29884.) 

>>> np.sum(a[2,3,2,:] * bL:,2]) 

array (29884. ) 


d21.mxnet.meshgrid(*x1, **kwargs) 
Return coordinate matrices from coordinate vectors. 


Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D 
grids, given one-dimensional coordinate arrays x1, x2,..., xn. 


x1, x2,..., xn [ndarrays] 1-D arrays representing the coordinates of a grid. 


indexing [{‘xy’, íf}, optional] Cartesian (‘xy’, default) or matrix (‘ij’) indexing of output. See 
Notes for more details. 


sparse [bool, optional] If True a sparse grid is returned in order to conserve memory. De- 
fault is False. Please note that sparse=True is currently not supported. 


copy [bool, optional] If False, a view into the original arrays are returned in order to con- 
serve memory. Default is True. Please note that copy=False is currently not supported. 


X1, X2,..., XN [ndarray] For vectors x1, x2,..., ‘xn’ with lengths Ni=len(xi) , return (N1, N2, 
N3,...Nn) shaped arrays if indexing='ij’ or (N2, N1, N3,...Nn) shaped arrays if index- 
ing=’xy’ with the elements of xi repeated to fill the matrix along the first dimension for 
x1, the second for x2 and so on. 


This function supports both indexing conventions through the indexing keyword argument. 
Giving the string ‘ij’ returns a meshgrid with matrix indexing, while ‘xy’ returns a meshgrid 
with Cartesian indexing. In the 2-D case with inputs of length M and N, the outputs are of 
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shape (N, M) for ‘xy’ indexing and (M, N) for ‘j’ indexing. In the 3-D case with inputs of length 
M, N and P, outputs are of shape (N, M, P) for ‘xy’ indexing and (M, N, P) for ‘ij’ indexing. 
The difference is illustrated by the following code snippet: 


xv, yv = np.meshgrid(x, y, sparse=False, indexing='ij') 
for i in range(nx): 
for j in range(ny): 
He ee SMALL al, WE a 


xv, yv = np.meshgrid(x, y, sparse=False, indexing='xy') 
for i in range(nx): 
for j in range(ny): 
# treat xv[j,i], yv[j,i] 
In the 1-D and 0-D case, the indexing and sparse keywords have no effect. 


d21.mxnet.normal (loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None) 
Draw random samples from a normal (Gaussian) distribution. 


Samples are distributed according to a normal distribution parametrized by loc (mean) and 
scale (standard deviation). 


loc [float, optional] Mean (centre) of the distribution. 
scale [float, optional] Standard deviation (spread or “width”) of the distribution. 


size [int or tuple of ints, optional] Output shape. If the given shape is, e.g., (m, n, k), then 
m * n * k samples are drawn. If size is None (default), a scalar tensor containing a sin- 
gle value is returned if loc and scale are both scalars. Otherwise, np.broadcast (low, 
high) .size samples are drawn. 


dtype [{‘float16’, ‘float32’, ‘float64’}, optional] Data type of output samples. Default is ‘float32’ 
ctx [Context, optional] Device context of output, default is current context. 


out [ndarray, optional] Store output to an existing ndarray. 
out [ndarray] Drawn samples from the parameterized normal distribution. 


The probability density for the Gaussian distribution is 








1 (au)? 
= — 202 19.7.1 
p(x) Tunes 207, ( ) 


where uis the mean and o the standard deviation. The square of the standard deviation, o”, 
is called the variance. 


The function has its peak at the mean, and its “spread” increases with the standard devia- 
tion (the function reaches 0.607 times its maximum at x + o and x — o°). This implies that 
numpy.random.normal is more likely to return samples lying close to the mean, rather than 
those far away. 


>>> mu, sigma = 0, 0.1 # mean and standard deviation 
>>> s = np.random.normal(mu, sigma, 1000) 





? P, R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th 
ed., 2001, pp. 51, 51, 125. 
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Verify the mean and the variance: 


>>> np.abs(mu - np.mean(s)) < 0.01 
array(True) 


mxnet . ones (shape, dtype=<class 'numpy. float32'>, order='C', ctx=None) 
Return a new array of given shape and type, filled with ones. This function currently only 
supports storing multi-dimensional data in row-major (C-style). 


shape [int or tuple of int] The shape of the empty array. 


dtype [str or numpy.dtype, optional] An optional value type. Default is numpy.float32. Note 
that this behavior is different from NumPy’s ones function where float64 is the default 
value, because float32 is considered as the default data type in deep learning. 


order [{‘C’}, optional, default: ‘C’] How to store multi-dimensional data in memory, cur- 
rently only row-major (C-style) is supported. 


ctx [Context, optional] An optional device context (default is the current default context). 


out [ndarray] Array of ones with the given shape, dtype, and ctx. 


>>> np.ones(5) 
arre Wo» Mos os Mos dod) 


>>> np.ones((5,), dtype=int) 
array([1, 1, 1, 1, 1], dtype=int64) 


>>> np.ones((2, 1)) 
array([[1.1, 
EJ) 


S= 5 = 2,2) 

>>> np.ones(s) 

array CELL... isd. 
Etoo 111 


mxnet.plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None, ylim=None, 
xscale='linear', yscale='linear', fmts='-', 'm--', 'g-.' 'r:', figsize=3.5, 2.5, 
axes=None) 
Plot data points. 


mxnet.predict_ch3(net, test_iter, n=6) 
Predict labels (defined in Chapter 3). 


mxnet.predict_ch8 (prefix, num_preds, net, vocab, device) 
Generate new characters following the prefix. 


.mxnet .predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps, device, 


save_attention_weights=False) 
Predict for sequence to sequence. 


mxnet.preprocess_nmt (text) 
Preprocess the English-French dataset. 
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mxnet . rand(*size, **kwargs) 
Random values in a given shape. 


Create an array of the given shape and populate it with random samples from a uniform 
distribution over [0, 1). Parameters ———- d0, d1, ..., dn: int, optional 


The dimensions of the returned array, should be all positive. If no argument is 
given a single Python float is returned. 


out [ndarray] Random values. 


>>> np.random.rand(3,2) 

array([[ 0.14022471, 0.96360618], #random 
[ 0.37601032, @.25528411], #random 
C 0.49313049, 0.9490987811) #random 


.mxnet.read_csv_labels(fname) 


Read fname to return a name to label dictionary. 


mxnet.read_data_bananas(is_train=True) 
Read the bananas dataset images and labels. 


mxnet.read_data_nmt() 
Load the English-French dataset. 


mxnet.read_snli (data_dir, is_train) 
Read the SNLI dataset into premises, hypotheses, and labels. 


mxnet.read_time_machine() 
Load the time machine dataset into a list of text lines. 


mxnet.read_voc_images(voc_dir, is_train=True) 
Read all VOC feature and label images. 


mxnet.resnet18(num_classes) 
A slightly modified ResNet-18 model. 


mxnet.seq_data_iter_random(corpus, batch_size, num_steps) 
Generate a minibatch of subsequences using random sampling. 


mxnet.seq_data_iter_sequential (corpus, batch_size, num_steps) 
Generate a minibatch of subsequences using sequential partitioning. 


mxnet.set_axes (axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend) 
Set the axes for matplotlib. 


mxnet.set_figsize(figsize=3.5, 2.5) 
Set the figure size for matplotlib. 


.mxnet . sgd (params, lr, batch_size) 


Minibatch stochastic gradient descent. 


mxnet.show_bboxes(axes, bboxes, labels=None, colors=None) 
Show bounding boxes. 


mxnet . show_images(imgs, num_rows, num_cols, titles=None, scale=1.5) 
Plot a list of images. 
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d21.mxnet.show_trace_2d(f, results) 
Show the trace of 2D variables during optimization. 


d21.mxnet.sin(x, out=None, **kwargs) 
Trigonometric sine, element-wise. 


x [ndarray or scalar] Angle, in radians (27 rad equals 360 degrees). 


out [ndarray or None] A location into which the result is stored. If provided, it must have 
a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array 
is returned. The dtype of the output is the same as that of the input if the input is an 
ndarray. 


y [ndarray or scalar] The sine of each element of x. This is a scalar if x is a scalar. 


This function only supports input type of float. 


>>> np.sin(np.pi/2.) 


1.0 
>>> np.sin(np.array((0., 30., 45., 60., 90.)) * np.pi / 180.) 
array(LQ. o Od , 0.70710677, 0.86602545, 1. 1) 


d21.mxnet.sinh(x, out=None, **kwargs) 
Hyperbolic sine, element-wise. Equivalent to 1/2 * (np.exp(x) - np.exp(-x)) or -1j * 
np.sin(1j*x). 


x [ndarray or scalar] Input array or scalar. 


out [ndarray or None] A location into which the result is stored. If provided, it must have 
a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array 
is returned. The dtype of the output is the same as that of the input if the input is an 
ndarray. 


y [ndarray or scalar] The corresponding hyperbolic sine values. This is a scalar if x is a 
scalar. 


This function only supports input type of float. 


>>> np.sinh(0) 

0.0 

>>> # Example of providing the optional output parameter 
>>> outl = np.array([0], dtype='f') 

>>> out2 = np.sinh(np.array([0.1]), out1) 

>>> out2 is outl 

True 


d21.mxnet.split_batch(X, y, devices) 
Split X and y into multiple devices. 


d21.mxnet.split_batch_multi_inputs(X, y, devices) 
Split multi-input X and y into multiple devices. 


d21.mxnet.split_data_m1100k(data, num_users, num_items, split_mode='random', 


test_ratio=0. 1) 
Split the dataset in random mode or seq-aware mode. 
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d21.mxnet.squared_loss(y_hat, y) 
Squared loss. 


d21.mxnet.stack(arrays, axis=0, out=None) 


Join a sequence of arrays along a new axis. The axis parameter specifies the index of the 
new axis in the dimensions of the result. For example, if axis=0 it will be the first di- 
mension and if axis=-1 it will be the last dimension. 


arrays [sequence of array_like] Each array must have the same shape. 
axis [int, optional] The axis in the result array along which the input arrays are stacked. 


out [ndarray, optional] If provided, the destination to place the result. The shape must be 


correct, matching that of what stack would have returned if no out argument were spec- 
ified. 


stacked [ndarray] The stacked array has one more dimension than the input arrays. 


concatenate : Join a sequence of arrays along an existing axis. split : Split array into a list of 
multiple sub-arrays of equal size. 


>>> arrays = [np.random.rand(3, 4) for _ in range(10)] 
>>> np.stack(arrays, axis=0).shape 
(10, 3, 4) 


>>> np.stack(arrays, axis=1).shape 
CLT 


>>> np.stack(arrays, axis=2).shape 
CHEO 


>>> a = np.array([1, 2, 3]) 
>>> b = np.array([2, 3, 4]) 
>>> np.stack((a, b)) 
AS A 

ag Soy “edi 


>>> np.stack((a, b), axis=-1) 
arre ELL... Zada 

ASA 

ESPA AD) 


d21.mxnet.synthetic_data(w, b, num_examples) 
Generate y = Xw + b + noise. 


d21.mxnet.tanh(x, out=None, **kwargs) 
Compute hyperbolic tangent element-wise. Equivalent to np. sinh(x)/np.cosh(x). 


x [ndarray or scalar.] Input array. 


out [ndarray or None] A location into which the result is stored. If provided, it must have 
a shape that the inputs fill into. If not provided or None, a freshly-allocated array is 
returned. The dtype of the output and input must be the same. 
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y [ndarray or scalar] The corresponding hyperbolic tangent values. 


If out is provided, the function writes the result into it, and returns a reference to out. (See 
Examples) - input x does not support complex computation (like imaginary number) >>> 
np.tanh(np.pi*1j) TypeError: type <type ‘complex’> not supported 


>>> np.tanh(np.arrayl[0, np.pi])) 

array(L0. , 0.9962721]) 

>>> np.tanh(np.pi) 

0.99627207622075 

>>> # Example of providing the optional output parameter illustrating 
>>> # that what is returned is a reference to said parameter 

>>> outl = np.array(1) 

>>> out2 = np.tanh(np.array(0.1), out1) 

>>> out2 is outl 

True 


mxnet . tensor (object, dtype=None, ctx=None) 
Create an array. 


object [array_like or numpy.ndarray or mxnet.numpy.ndarray] An array, any object exposing 
the array interface, an object whose __array__ method returns an array, or any (nested) 
sequence. 


dtype [data-type, optional] The desired data-type for the array. Default is float32. 
ctx [device context, optional] Device context on which the memory is allocated. Default is 


mxnet.context.current_context(). 


out [ndarray] An array object satisfying the specified requirements. 


>>> np.array([1, 2, 3]) 
arrollo.) Zo, Bol) 


>>> np.array([[1, 2], [3, 4]]) 
arre dE... Bol. 
Bey edd) 


>>> np.array([[1, 0], [0, 11], dtype=bool) 
array([[ True, Falsel, 
[False, True]]) 


mxnet .tokenize(lines, token='word') 
Split text lines into word or character tokens. 


mxnet. tokenize_nmt (text, num_examples=None) 
Tokenize the English-French dataset. 


mxnet.train_2d(trainer, steps=20) 
Optimize a 2-dim objective function with a customized trainer. 


mxnet.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater) 
Train a model (defined in Chapter 3). 
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mxnet.train_ch6 (net, train_iter, test_iter, num_epochs, Ir, device=gpu(0)) 
Train a model with a GPU (defined in Chapter 6). 


.mxnet.train_ch8(net, train_iter, vocab, lr, num_epochs, device, use_random_iter=False) 


Train a model (defined in Chapter 8). 


mxnet.train_epoch_ch3(net, train_iter, loss, updater) 
Train a model within one epoch (defined in Chapter 3). 


mxnet.train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter) 
Train a model within one epoch (defined in Chapter 8). 


.mxnet .train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device) 


Train a model for sequence to sequence. 


mxnet.transpose_output(X, num_heads) 
Reverse the operation of transpose_qkv 


mxnet.truncate_pad(line, num_steps, padding_token) 
Truncate or pad sequences. 


mxnet.try_all_gpus() 
Return all available GPUs, or [cpu()] if no GPU exists. 


mxnet.try_gpu(i=0) 
Return gpu(i) if exists, otherwise return cpu(). 


mxnet.update_D(X, Z, net_D, net_G, loss, trainer_D) 
Update discriminator. 


mxnet.update_G(Z, net_D, net_G, loss, trainer_G) 
Update generator. 


mxnet.use_svg_display() 
Use the svg format to display a plot in Jupyter. 


.mxnet.voc_label_indices(colormap, colormap2label) 


Map an RGB color to a label. 


mxnet.voc_rand_crop (feature, label, height, width) 
Randomly crop for both feature and label images. 


mxnet.zeros(shape, dtype=None, order='C', ctx=None) 
Return a new array of given shape and type, filled with zeros. This function currently only 
supports storing multi-dimensional data in row-major (C-style). 


shape [int or tuple of int] The shape of the empty array. 


dtype [str or numpy.dtype, optional] An optional value type (default is numpy.float32). Note 
that this behavior is different from NumPy’s zeros function where float64 is the default 
value, because float32 is considered as the default data type in deep learning. 


order [{‘C’}, optional, default: ‘C’] How to store multi-dimensional data in memory, cur- 
rently only row-major (C-style) is supported. 


ctx [Context, optional] An optional device context (default is the current default context). 


out [ndarray] Array of zeros with the given shape, dtype, and ctx. 
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>>> np.zeros(5) 
AEM. 5 Des Dos Boy Bol 


>>> np.zeros((5,), dtype=int) 
array([0, 0, 0, 0, 0], dtype=int64) 


>>> np.zeros((2, 1)) 
array([L0.], 
[0.]]) 
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